Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hmem,efa: remove gdrcopy from cuda hmem copy path and make gdrcopy calls explicit #8836

Merged
merged 2 commits into from
Apr 26, 2023

Conversation

wenduwan
Copy link
Contributor

With the introduction of hmem_data MR field, the gdrcopy handle should be
stored along side device id, and gdrcopy should be requested explicitly
by the caller.

This patch:

  1. replaces calls to ofi_copy_from/to_hmem* functions with
    ofi_gdrcopy_from/to_cuda* equivalent when a gdrcopy handle is present.
  2. removes gdrcopy from cuda hmem copy path

With the new hmem_data MR field, device can only be device id. Therefore
it is no longer necessary to validate explicitly.

With this change, gdrcopy should be requested by the caller separately.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor Author

bot:aws:retest

@wenduwan wenduwan marked this pull request as ready for review April 24, 2023 13:44
@wenduwan wenduwan requested a review from a team April 24, 2023 13:45
prov/efa/src/efa_hmem.c Show resolved Hide resolved
Comment on lines 455 to 465
if (rxe->bytes_copied + data_size == rxe->total_len) {
ofi_copy_to_hmem_iov(desc->peer.iface, desc->peer.device.reserved,
rxe->iov, rxe->iov_count,
data_offset + ep->msg_prefix_size,
data, data_size);
ofi_gdrcopy_to_cuda_iov((uint64_t)desc->peer.hmem_data,
rxe->iov, rxe->iov_count,
data_offset + ep->msg_prefix_size,
data, data_size);
rxr_pkt_handle_data_copied(ep, pkt_entry, data_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there were 5 tabs followed by a few spaces, now you have 3 tabs followed by lots of spaces. Should re-tabbify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked in my editor - I find it more reliable to achieve "github correct" indent with explicit spaces in the case of multi-line function calls, for my configuration.

Other places, though rendering correctly on github, show up crooked in my editor.

I'm unlikely alone since I see similar indent format choices.

@wenduwan wenduwan requested a review from a team April 24, 2023 19:16
prov/efa/src/efa_hmem.h Show resolved Hide resolved
prov/efa/src/efa_hmem.h Show resolved Hide resolved
prov/efa/src/efa_hmem.h Outdated Show resolved Hide resolved
prov/efa/src/rdm/rxr_pkt_type_base.c Show resolved Hide resolved
@wenduwan wenduwan force-pushed the efa_hmem_copy branch 3 times, most recently from d6c3125 to cfcf77f Compare April 24, 2023 20:39
With the introduction of hmem_data MR field, the gdrcopy handle should be
stored along side device id, and gdrcopy should be requested explicitly
by the caller.

This patch replaces calls to ofi_copy_from/to_hmem* functions with
ofi_gdrcopy_from/to_cuda* equivalent when a gdrcopy handle is present.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor Author

bot:aws:retest

1 similar comment
@wzamazon
Copy link
Contributor

bot:aws:retest

@wzamazon wzamazon requested a review from shefty April 25, 2023 13:04
@wenduwan
Copy link
Contributor Author

@shefty Could you kindly check the CI failure? Is it related to this change?

@shefty
Copy link
Member

shefty commented Apr 25, 2023

There were a bunch of socket failures, so I re-ran the CI. The latest failure is from our friend:

fi_rdm_tagged_peek -p "sockets"
10:42:26    timestamp: 20230425-174220+0000
10:42:26    result: Fail

@wenduwan
Copy link
Contributor Author

tcp failure

- name:   rdm_tagged_peek -p tcp
  timestamp: Mon 04/24/2023 21:15:54.16
  result: Fail
  time:   183
  server_cmd: C:\projects\libfabric\fabtests\x64\Release-v142\rdm_tagged_peek -p tcp  -s 127.0.0.1
  server_stdout:
[error] fabtests:common\shared.c:2310: Tag mismatch!. Expected: 2748, actual: 1
fi_cq_read/fi_cq_sread(): functional\rdm_tagged_peek.c:107, ret=-256 (Unspecified error)
Receive sync(): functional\rdm_tagged_peek.c:334, ret=-256 (Unspecified error)
Sending 10 tagged messages
Waiting for messages to complete
  client_cmd: C:\projects\libfabric\fabtests\x64\Release-v142\rdm_tagged_peek -p tcp  -s 127.0.0.1 127.0.0.1
  client_stdout:

This should not be related though since we are not touching tcp path...

@shijin-aws
Copy link
Contributor

shijin-aws commented Apr 25, 2023

@shefty I am a bit worried if there is a bug in the rdm_tagged_peek test. If we can have a consistent reproducer I can take a look at it.

@shefty
Copy link
Member

shefty commented Apr 25, 2023

That test must have an issue somewhere, either in the test or its use of common code

@shijin-aws
Copy link
Contributor

That test must have an issue somewhere, either in the test or its use of common code

I will try if I can reproduce with tcp, efa has not hit this issue so far AFAIK

@wzamazon
Copy link
Contributor

@shefty @shijin-aws

Since we believe the fi_rdm_tagged_peek failure is not related to this PR, can we merge it?

@shijin-aws
Copy link
Contributor

Yes, I am looking at #8844 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants