Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add progress messages to drm #270

Merged
merged 30 commits into from
May 13, 2019
Merged

add progress messages to drm #270

merged 30 commits into from
May 13, 2019

Conversation

dsikich
Copy link

@dsikich dsikich commented May 1, 2019

Use start, update, and complete functions to add progress messages to drm.

Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
@dsikich
Copy link
Author

dsikich commented May 1, 2019

@adammoody This includes latest updates with new condition checks for when the ireduce completes.

@adammoody
Copy link
Member

adammoody commented May 1, 2019

Thanks @dsikich . Let's look at rank 0 now for a bit. In the update, we have this code:

if (*req1 == MPI_REQUEST_NULL) {
  MPI_Ibcast(keep_going, 1, MPI_INT, 0, dupcomm, req1);
  MPI_Ireduce(values, global_vals, 2, MPI_INT, MPI_SUM, 0, dupcomm, req2);
} else {
  MPI_Test(req1, &done1, MPI_STATUS_IGNORE);
  MPI_Test(req2, &done2, MPI_STATUS_IGNORE);
  if (done2) {
    printf("items removed: %d\n", global_vals[0]);
    fflush(stdout);
    *current = time(NULL);
  }
}

So if there is no outstanding bcast, it will start a bcast and a reduce. If there is an outstanding bcast, it will test both the bcast and the reduce. Let's also assume we have an outstanding reduce when we get to that part of the code, then four things can happen:

  1. neither the bcast nor the reduce complete, so both tests fail, and we exit the call to come back later -- I think we're ok here

  2. both the bcast and reduce complete, so both tests succeed, and we'll print a count and capture a new current time -- I think this is good too.

  3. the bcast completes, but the reduce does not -- here we'll hit a problem if we call update again. since there is no longer an outstanding bcast, we'll kick off a new bcast and a new reduce without having waited on our previous reduce to finish

  4. the reduce complete, but the bcast does not -- here we would print the message and grab the current time. future calls to update would continue to test against the bcast until it finishes. I think that case is ok as written. the bcast would eventually complete, then the next call to update would start a new bcast/reduce round, which is ok.

We need to fix case three (bcast completes, but reduce does not). It may be enough to check that both requests are NULL in the if condition.

EDIT: Oh, case 4 also has a minor problem. Since we are updating current, future calls to update on rank 0 will return immediately until our timeout expires again. Only after the timeout expires, will we start testing the bast again, which means we'd have to call update twice before rank 0 kicks off a new bcast/reduce pair. We should think about how to fix that too.

@adammoody
Copy link
Member

adammoody commented May 1, 2019

After that, take a close look at the complete code for rank 0 and think through rank 0 calling complete in all four cases:

  1. outstanding bcast and outstanding reduce
  2. nothing outstanding
  3. outstanding bcast, but no reduce
  4. outstanding reduce, but no bcast

I haven't done this yet, so unsure whether it needs to be changed.

@adammoody
Copy link
Member

Found this in case it's useful:

https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-1.1/node47.htm

One is allowed to call MPI_WAIT with a null or inactive request argument. In this case the operation returns immediately with empty status.

One is allowed to call MPI_TEST with a null or inactive request argument. In such a case the operation returns with flag = true and empty status.

@dsikich
Copy link
Author

dsikich commented May 1, 2019

@adammoody Thanks, that is good to know. In this case though, we were talking about how it might be better to have the checks in there so that it is obvious what state the requests are in?

@adammoody
Copy link
Member

Yep, whichever way you'd like to do it is fine. If you do want to leave out some if checks, we can add comments to remind the people reading the code that things work even if the requests are NULL.

@adammoody
Copy link
Member

This text says that it's ok to test/wait on collectives out of order, they must only be started in order:

https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node126.htm

All completion calls (e.g., MPI_WAIT) described in Section Communication Completion are supported for nonblocking collective operations. Similarly to the blocking case, nonblocking collective operations are considered to be complete when the local part of the operation is finished, i.e., for the caller, the semantics of the operation are guaranteed and all buffers can be safely accessed and modified. Completion does not indicate that other processes have completed or even started the operation (unless otherwise implied by the description of the operation). Completion of a particular nonblocking collective operation also does not indicate completion of any other posted nonblocking collective (or send-receive) operations, whether they are posted before or after the completed operation.

Unlike point-to-point operations, nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a tag argument. All processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a process calls a collective operation, all other processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments.

@dsikich
Copy link
Author

dsikich commented May 1, 2019

@adammoody sounds like you were right about that then!

@adammoody
Copy link
Member

Yeah, just wanted to double check. I helped write that text, so good that I remembered it :-)

@dsikich
Copy link
Author

dsikich commented May 1, 2019

@adammoody looks like travis is failing on the MPI_Ibcast and MPI_Ireduce. It is using openmpi from what I can see.. does openmpi support these non-blocking collectives? Maybe we need to update the version we are installing in travis.

@adammoody
Copy link
Member

Open MPI should have those. Yes, we likely need to bump the Open MPI version that travis is using.

Danielle Sikich added 2 commits May 1, 2019 18:53
Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
Signed-off-by: Danielle Sikich <sikich1@llnl.gov>
@adammoody
Copy link
Member

adammoody commented May 9, 2019

Great work on the cleanup! The logic looks good to me after my first pass.

Instead of passing in the comm_rank and comm_size, let's just make calls to MPI_Comm_rank and MPI_Comm_size from within the functions. That will simplify the interface.

Also, let's pass in the final count value as input to the complete call, like you have in the update call. This will ensure our total adds up to the full final count.

@dsikich
Copy link
Author

dsikich commented May 9, 2019

@adammoody Ok, did one final pass with those updates. I would just do more testing. I've testing so far with two processes, but it should be tested more extensively before production use.

@adammoody
Copy link
Member

Working this into different functions now. Here's a sample of the output from dcp now:

[2019-05-13T10:59:08] Copying data.
[2019-05-13T10:59:30] Copied 30.501 GB in 21.593206 secs (1.413 GB/s) ...
[2019-05-13T10:59:42] Copied 54.019 GB in 33.571640 secs (1.609 GB/s) ...
[2019-05-13T10:59:55] Copied 73.710 GB in 47.459370 secs (1.553 GB/s) ...
[2019-05-13T11:00:08] Copied 100.707 GB in 59.480167 secs (1.693 GB/s) ...
[2019-05-13T11:00:20] Copied 118.359 GB in 72.232474 secs (1.639 GB/s) ...
[2019-05-13T11:00:28] Copied 137.463 GB in 80.298619 secs (1.712 GB/s) ...
[2019-05-13T11:00:28] Copied 141.541 GB in 80.298805 secs (1.763 GB/s) done
[2019-05-13T11:00:28] Copied 141.541 GB in 80.298859 secs (1.763 GB/s) done
[2019-05-13T11:00:28] Copy data: 141.541 GB (151978344724 bytes)
[2019-05-13T11:00:28] Copy rate: 1.763 GB/s (151978344724 bytes in 80.298898 seconds)

We can also print percent complete and estimated time left if we pre-compute the total amount of work to be done. That will take a few extra steps.

@adammoody
Copy link
Member

adammoody commented May 13, 2019

Sample messages from remove operation. This one has percent complete and estimated time remaining:

[2019-05-13T11:57:45] Walked 100001 items in 0.234298 seconds (426811.171786 files/sec)
[2019-05-13T11:57:45] Removing 100001 items
[2019-05-13T11:57:55] Removed 11351 of 100001 items in 10.005774 secs (1134.445023 items/sec) 11.35% complete 78 secs remaining...
[2019-05-13T11:58:05] Removed 23173 of 100001 items in 20.007553 secs (1158.212623 items/sec) 23.17% complete 66 secs remaining...
[2019-05-13T11:58:15] Removed 35128 of 100001 items in 30.009201 secs (1170.574326 items/sec) 35.13% complete 55 secs remaining...
[2019-05-13T11:58:25] Removed 46959 of 100001 items in 40.009633 secs (1173.692358 items/sec) 46.96% complete 45 secs remaining...
[2019-05-13T11:58:35] Removed 59668 of 100001 items in 50.010395 secs (1193.111951 items/sec) 59.67% complete 33 secs remaining...
[2019-05-13T11:58:45] Removed 72914 of 100001 items in 60.011518 secs (1215.000099 items/sec) 72.91% complete 22 secs remaining...
[2019-05-13T11:58:55] Removed 84876 of 100001 items in 70.012076 secs (1212.305139 items/sec) 84.88% complete 12 secs remaining...
[2019-05-13T11:59:05] Removed 96655 of 100001 items in 80.013420 secs (1207.984855 items/sec) 96.65% complete 2 secs remaining...
[2019-05-13T11:59:08] level=5 min=50000 max=50000 sum=100000 rate=1197.965291 secs=83.474873
[2019-05-13T11:59:08] level=4 min=0 max=1 sum=1 rate=68.652165 secs=0.014566
[2019-05-13T11:59:08] Removed 100001 of 100001 items in 83.489614 secs (1197.765748 items/sec) 100.00% complete
[2019-05-13T11:59:08] Removed 100001 of 100001 items in 83.489650 secs (1197.765225 items/sec) 100.00% complete
[2019-05-13T11:59:08] Removed 100001 items in 83.526148 seconds (1197.241856 items/sec)

Copy link
Member

@adammoody adammoody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dsikich ! Found a couple of things to clean up when testing with longer timeouts that we didn't think about when white boarding the algorithm. Then did some code refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants