Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA CCI: Added error handling when cci_connect() fails #206

Closed

Conversation

marcvef
Copy link
Contributor

@marcvef marcvef commented Jan 29, 2018

When na_cci_addr_lookup() is eventually called from HG_Addr_lookup(), the function cci_connect() can return an error which is currently not handled. As a result, clients receive HG_SUCCESS when calling HG_Addr_lookup() although it has failed. This patch returns the appropriate error message for two cases: (1) cci_connect() timed out, or (2) other errors.

We faced this situation, while using CCI with the verbs plugin, when many clients called HG_Addr_lookup() to the same servers, essentially overwhelming the server to respond in time. The following error message was produced on the client side, with HG_Addr_lookup() returning HG_SUCCESS:

 # na_cci_addr_lookup(): cci_connect(verbs://a0357-ib:4433) failed with CCI_ETIMEDOUT
# NA -- Error -- /some/path/mercury/src/na/na_cci.c:833

Note, that cci_connect() can be given an optional timeout value (currently set to NULL in na_cci_addr_lookup(), i.e., wait forever according to documentation), which is not used in cci_connect().

@soumagne
Copy link
Member

Interesting, thanks! I'll go through it and merge it.

@soumagne soumagne added this to the mercury-1.0.0 milestone Oct 2, 2018
@soumagne soumagne closed this in 1f6a92e Oct 25, 2018
range3 pushed a commit to range3/GekkoFS that referenced this pull request Apr 27, 2021
Also removed mercury_cci_verbs_lookup patch, because it has been merged upstream [1]

[1]: mercury-hpc/mercury#206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants