Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update providers to report source addresses #2618

Closed
2 of 11 tasks
shefty opened this issue Dec 23, 2016 · 13 comments
Closed
2 of 11 tasks

Update providers to report source addresses #2618

shefty opened this issue Dec 23, 2016 · 13 comments
Labels
Milestone

Comments

@shefty
Copy link
Member

shefty commented Dec 23, 2016

Providers that will support ABI 1.5 need to be updated to support 461306f.

  • bgq
  • gni
  • mlx
  • psm
  • psm2
  • rxd
  • rxm
  • sockets
  • udp
  • usnic
  • verbs
@mblockso
Copy link
Contributor

@pkcoff

@shefty shefty added the bug label Jan 12, 2017
j-xiong added a commit to j-xiong/libfabric that referenced this issue Jan 13, 2017
See ofiwg#2618.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
@hppritcha
Copy link
Contributor

hppritcha commented Jan 19, 2017

We're looking at how to implement this functionality in the GNI provider and have hit a problem. Namely the GNI address can't fit into the 64 bits of the err_data field, which means we'd have to allocate memory somewhere else and have this field point to that memory, but then how would we free it? The way the psm2 provider is handling it, it looks like there can only have one outstanding error CQE for this type of error condition since the provider squirrels the error data into a field on the psmx2_fid_cq.

@shefty
Copy link
Member Author

shefty commented Jan 19, 2017

There are a couple ways to handle this, but this is a general work-able flow:

  1. Each CQ has a single err_data pointer.
  2. When writing an error to the CQ, allocate a data buffer for the error data. The CQ entry should somehow point to the error data buffer.
  3. Return EAVAIL when the app attempts to read the error entry using cq_read
  4. On cq_readfrom, if cq.err_data is not NULL, free it. Set the cq.err_data pointer to the entry's error data buffer. Remove the entry from the CQ.

@hppritcha
Copy link
Contributor

I don't see how this would work reliably for FI_THREAD_SAFE thread safety mode.

@shefty
Copy link
Member Author

shefty commented Jan 20, 2017 via email

@pkcoff
Copy link
Contributor

pkcoff commented Feb 14, 2017

On the call today Sean mentioned BGQ needs to sign off on this, however I don't think this applies to us since we've implemented FI_DIRECTED_RECV and we are sending the source address in the packet header and not looking it up in the address vector

@shefty
Copy link
Member Author

shefty commented Feb 14, 2017 via email

sayantansur pushed a commit to sayantansur/libfabric that referenced this issue Feb 15, 2017
See ofiwg#2618.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
@mblockso
Copy link
Contributor

What Paul is saying is that the bgq receive protocol does not reference/access the AV at all. The source address is transferred with the packet metadata from origin to target. If we need to error-check that source address then that would introduce an array lookup and likely a cache miss.

On bgq a user would have to try and explicitly mess this up (mismatched source address and/or missing address vector element) and this will not happen in practice.

@shefty @pkcoff

@shefty
Copy link
Member Author

shefty commented Mar 14, 2017

For the original issue, I have no objection to a provider just clicking 'done'. If this means that we document that the provider requires that all addresses be inserted into the AV up front, that's fine. I agree that we don't want to force a lookup just to make this check.

We can discuss if the app should opt into this using some other mechanism that just FI_SOURCE + API version.

@shefty
Copy link
Member Author

shefty commented Mar 14, 2017

Based on ofiwg pie day discussions, further changes will be proposed.

@pkcoff
Copy link
Contributor

pkcoff commented Mar 14, 2017

After the call today just wanted to further explain -- right now we have just 1 user/app on bgq which is mpich - for all ranks the provider is computing this av in the same way and storing it locally at mpi_init, subsequently any communication initiated by mpich say for send-recv in FI_DIRECTED_RECV mode the src_addr fi_addr_t struct will have contents that mpich explicitly got from the bgq provider so the scenario for a bogus source address shouldn't exist --- and we certainly want an option to NOT have to check for this to avoid the overhead

@bturrubiates
Copy link
Member

After the call today just wanted to further explain -- right now we have just 1 user/app on bgq which is mpich - for all ranks the provider is computing this av in the same way and storing it locally at mpi_init, subsequently any communication initiated by mpich say for send-recv in FI_DIRECTED_RECV mode the src_addr fi_addr_t struct will have contents that mpich explicitly got from the bgq provider so the scenario for a bogus source address shouldn't exist --- and we certainly want an option to NOT have to check for this to avoid the overhead

I don't understand this comment, I think I missed something. I thought this issue only applied to fi_cq_readfrom, why are you mentioning send/recv?

Does the bgq provider always use fi_cq_readfrom?

@pkcoff
Copy link
Contributor

pkcoff commented Mar 15, 2017

No, i guess i thought this was for fi_cq_read also. Anyhow, bgq cq's contain fi_bgq_context's which also have fi_addr_t src addrs, and these src addrs are guaranteed to be valid and already in the av and this error condition cannot occur and I just don't want to have any overhead from this mechanism to report unknown source data since that can't happen on bgq.

@shefty shefty closed this as completed Aug 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants