-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add progress messages to drm #270
Conversation
Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
@adammoody This includes latest updates with new condition checks for when the ireduce completes. |
Thanks @dsikich . Let's look at rank 0 now for a bit. In the update, we have this code:
So if there is no outstanding bcast, it will start a bcast and a reduce. If there is an outstanding bcast, it will test both the bcast and the reduce. Let's also assume we have an outstanding reduce when we get to that part of the code, then four things can happen:
We need to fix case three (bcast completes, but reduce does not). It may be enough to check that both requests are NULL in the if condition. EDIT: Oh, case 4 also has a minor problem. Since we are updating current, future calls to update on rank 0 will return immediately until our timeout expires again. Only after the timeout expires, will we start testing the bast again, which means we'd have to call update twice before rank 0 kicks off a new bcast/reduce pair. We should think about how to fix that too. |
After that, take a close look at the complete code for rank 0 and think through rank 0 calling complete in all four cases:
I haven't done this yet, so unsure whether it needs to be changed. |
Found this in case it's useful: https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-1.1/node47.htm One is allowed to call MPI_WAIT with a null or inactive request argument. In this case the operation returns immediately with empty status. One is allowed to call MPI_TEST with a null or inactive request argument. In such a case the operation returns with flag = true and empty status. |
@adammoody Thanks, that is good to know. In this case though, we were talking about how it might be better to have the checks in there so that it is obvious what state the requests are in? |
Yep, whichever way you'd like to do it is fine. If you do want to leave out some if checks, we can add comments to remind the people reading the code that things work even if the requests are NULL. |
This text says that it's ok to test/wait on collectives out of order, they must only be started in order: https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node126.htm All completion calls (e.g., MPI_WAIT) described in Section Communication Completion are supported for nonblocking collective operations. Similarly to the blocking case, nonblocking collective operations are considered to be complete when the local part of the operation is finished, i.e., for the caller, the semantics of the operation are guaranteed and all buffers can be safely accessed and modified. Completion does not indicate that other processes have completed or even started the operation (unless otherwise implied by the description of the operation). Completion of a particular nonblocking collective operation also does not indicate completion of any other posted nonblocking collective (or send-receive) operations, whether they are posted before or after the completed operation. Unlike point-to-point operations, nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a tag argument. All processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a process calls a collective operation, all other processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments. |
@adammoody sounds like you were right about that then! |
Yeah, just wanted to double check. I helped write that text, so good that I remembered it :-) |
@adammoody looks like travis is failing on the MPI_Ibcast and MPI_Ireduce. It is using openmpi from what I can see.. does openmpi support these non-blocking collectives? Maybe we need to update the version we are installing in travis. |
Open MPI should have those. Yes, we likely need to bump the Open MPI version that travis is using. |
Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
Great work on the cleanup! The logic looks good to me after my first pass. Instead of passing in the comm_rank and comm_size, let's just make calls to MPI_Comm_rank and MPI_Comm_size from within the functions. That will simplify the interface. Also, let's pass in the final count value as input to the complete call, like you have in the update call. This will ensure our total adds up to the full final count. |
@adammoody Ok, did one final pass with those updates. I would just do more testing. I've testing so far with two processes, but it should be tested more extensively before production use. |
Working this into different functions now. Here's a sample of the output from dcp now:
We can also print percent complete and estimated time left if we pre-compute the total amount of work to be done. That will take a few extra steps. |
Sample messages from remove operation. This one has percent complete and estimated time remaining:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dsikich ! Found a couple of things to clean up when testing with longer timeouts that we didn't think about when white boarding the algorithm. Then did some code refactoring.
Use start, update, and complete functions to add progress messages to drm.