Skip to content

Conversation

@awlauria
Copy link
Contributor

  • Send an ack back to the origin letting it know when it is
    safe to invoke the completion callback. Otherwise it
    can be called before the data has arrived.

    Found with test accfence1.c.

Signed-off-by: Austen Lauria awlauria@us.ibm.com

@awlauria
Copy link
Contributor Author

@awlauria awlauria requested review from devreal and hjelmn March 25, 2021 16:05
@bosilca
Copy link
Member

bosilca commented Mar 25, 2021

Can you give a little more details on what exactly this PR fixes ? My understanding is that for MPI_Put and MPI_Accumulate as we don't expect an answer from the target, the operations are not synchronizing, which means we only need to know about the local completion.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 25, 2021

@bosilca that is right.

The issue this addresses is osc/rdma relies on the 'completion' callback from btl_put() to know when it was complete for the sync (MPI_Win_fence, ex). Since this callback was only detecting local completion in the tcp/btl layer, osc/rdma thought it was completed remotely when it in fact wasn't, so the fence would return before the data had actually arrived.

@awlauria awlauria requested a review from bosilca March 25, 2021 18:03
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nitpicks. There are also some pretty long lines but I guess clang-format will massage that code anyway...

#define MCA_BTL_TCP_HDR_TYPE_PUT 2
#define MCA_BTL_TCP_HDR_TYPE_GET 3
#define MCA_BTL_TCP_HDR_TYPE_FIN 4
#define MCA_BTL_TCP_HDR_TYPE_PUT_ACK 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While at it, can we make this an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bosilca
Copy link
Member

bosilca commented Mar 26, 2021

The issue this addresses is osc/rdma relies on the 'completion' callback from btl_put() to know when it was complete for the sync (MPI_Win_fence, ex). Since this callback was only detecting local completion in the tcp/btl layer, osc/rdma thought it was completed remotely when it in fact wasn't, so the fence would return before the data had actually arrived.

But then the solution proposed here is very expensive, because it requires a sync for every single put and accumulate. This also requires progress, or the ack will not be extracted from the network. A more efficient approach would be to count the ops on the origin and on the target, and to have an explicit handshake for the match. This would allow all put/accumulate to remain as they are today (aka partially or locally completed) and force a single RTT on the synchronization.

@awlauria
Copy link
Contributor Author

@bosilca I agree your proposal would be better. This solution just makes the put/accumulate via btl/tcp 'correct', which it currently is not. TCP isn't expected to be performant anyway, but it should be correct. I can iterate on improving performance, but this change at least makes the rdma/tcp path viable for osc with Put/Accumulate.

@awlauria
Copy link
Contributor Author

We can always go with this solution so it at least works for v5.0.x, and I can investigate implementing the better performant path, but I suspect it will additional code changes in osc/rdma.

@awlauria awlauria force-pushed the fix_btl_tcp_fence branch from 196ae07 to 254a215 Compare March 26, 2021 17:18
@awlauria
Copy link
Contributor Author

awlauria commented Mar 26, 2021

@devreal changes addressed, thanks!

@bosilca
Copy link
Member

bosilca commented Mar 26, 2021

I need to look more at the code to refresh my memory on why we do not have support for AM with acknowledgement. What I had in mind was something similar to the handling of synchronous sends in pt2pt.

@awlauria awlauria force-pushed the fix_btl_tcp_fence branch from 254a215 to 3d3727c Compare March 26, 2021 18:10
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 26, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 26, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 26, 2021
- Send an ack back to the origin letting it know when it is
  safe to invoke the completion callback. Otherwise it
  can be called before the data has arrived.

  Found with test accfence1.c.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria awlauria force-pushed the fix_btl_tcp_fence branch from 3d3727c to cc449df Compare March 26, 2021 19:07
@awlauria
Copy link
Contributor Author

awlauria commented Apr 1, 2021

@bosilca a much simpler solution is to use the active messaging in btl/base. I made that change in
#8756
and it works as well.

@awlauria awlauria closed this Apr 7, 2021
@awlauria awlauria deleted the fix_btl_tcp_fence branch March 17, 2022 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants