Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async p2p bugfix #165

Merged
merged 2 commits into from
Jun 7, 2022
Merged

Async p2p bugfix #165

merged 2 commits into from
Jun 7, 2022

Conversation

xuyao0127
Copy link
Collaborator

@xuyao0127 xuyao0127 commented Jun 7, 2022

This is a fix of issue #161. When sharing large data with asynchronous point-to-point communication, MPICH requires a handshake between the sender and receiver. If MPI_Irecv has not been called before checkpoint time, only the metadata of the message is shared among MPI processes and the data is still in the sender's local buffer. As a result, MPI_Iprobe in the p2p draining algorithm can detect the incoming message, but MPI_Irecv can't receive the message. In this PR, I also call MPI_Test on pending MPI_Isend requests to complete potential pending handshakes.

Copy link
Collaborator

@gc00 gc00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I'm approving it in advance, but please see the minor suggestions (mostly comments) to add.

int size = 0;
MPI_Type_size(call->datatype, &size);
int worldRank = localRankToGlobalRank(status.MPI_SOURCE,
call->comm);
call->comm);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, but I find that the code is more readable when this argument call->comm is indented under the first argument (under status), as it was before.

UPDATE_REQUEST_MAP(request, MPI_REQUEST_NULL);
int flag = 0;
MPI_Status status;
MPI_Test_internal(&request, &flag, &status, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment above this similar to your comment in this PR? This will help others to understand the code better. You can either insert your entire PR comment, which is very well written, or if you're searching for something shorter, perhaps:

// This is needed if an MPI_Isend was called earlier.  Without this,
// large messages within the same node will fail under MPICH and some
// other MPIs.  A previous call to MPI_Irecv caused only the metadata to be
// exchanged.  So, MPI_Iprobe succeeds and MPI_Irecv will later fail, unless
// we force the sending of data via MPI_Test. 

for (i = 0; i < g_world_size; i++) {
if (g_bytesSentToUsByRank[i] > g_recvBytesByRank[i]) {
return false;
}
}
// No pending MPI_Isend request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a more expressive comment. Perhaps something like:

// recvMsgIntoInternalBuffer returns false if no MPI_Isend request is found:

@@ -228,8 +239,8 @@ drainSendRecv()
while (!allDrained()) {
sleep(1);
int numReceived = 0;
// If pending MPI_Irecvs, use MPI_Test in case msg was sent.
numReceived += completePendingIrecvs();
// If pending MPI_Irecvs or MPI_Isend, use MPI_Test to try to complete them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending MPI_Irecvs -> pending MPI_Irecv
(I know this is a previous comment. But let's polish it further now.)

to complete them -> to complete it ("A or B" is singular, and "A and B" is plural.)

#include <unistd.h>

#define MSG_SIZE 64*1024 // large message
#define MSG_SIZE 64 // small message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#define MSG_SIZE 64*1024 // large message
#ifndef MSG_SIZE
# define MSG_SIZE 64 // small message
#endif
  1. Some compilers will complain about re-defining MSG_SIZE without this.
    The default version uses a large message in this version. But I think this is a good thing, since that's what caused the original bug.
  2. Should we raise the large message size to 256*1024, to make it more future-proof? In the future, maybe the optimized hardware considers even 64*1024 to be a small message that can use an EAGER policy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake. I was thinking of commenting out one of two definitions.

#include <stdlib.h>
#include <assert.h>
#include <unistd.h>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comment like:

// For testing, checkpoint this during the 5-second sleep between
// MPI_Isend (rank 1) and MPI_Irecv (rank 0).

@xuyao0127 xuyao0127 merged commit 6fb119f into mpickpt:main Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants