-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA SM: address lookup errors: ETOOMANYREFS #385
Comments
Note to self, in the kernel code:
this is controlled by |
Sorry, should have commented on this when I originally tried it. But yes, the test case I mention above works completely fine if I bump up the number for RLIMIT_NOFILE. |
ok great thanks |
Track fds that are in use Clean up progress routines
@shanedsnyder should be fixed now, let me know if you have any problems. |
Just tried this out and can confirm this error no longer occurs using mercury master. Thanks for the fix @soumagne ! |
Describe the bug
When performing many concurrent lookups of other node-local processes using the na+sm transport, you can pretty reliably trigger an error on some of the processes:
Specifically, the returned error from the
sendmsg()
call is ETOOMANYREFS, which I found the following notes on in the man page:I am triggering this issue using a SSG group membership service (https://xgitlab.cels.anl.gov/sds/ssg) test case, which uses MPI to collectively start a bunch of na+sm endpoints on a node, then concurrently on each process looks up all other group members' addresses using Argobots threads. It looks like this is causing too many "in flight" file descriptors, exceeding the
RLIMIT_NOFILE
maximum, which appears to be 1024. I'm able to hit this error pretty frequently using 32+ processes on a node.Is this something Mercury could somehow detect and respond to, or does that responsibility fall on the users to sensibly limit number of outstanding lookups?
The behavior I've described is all with the master version of Mercury. I just tried with v2.0.0a1 and don't see the issue, but I also notice that the na+sm address lookups clearly are taking a longer time in that version, which may be effectively distributing them over time so as to avoid hitting this maximum.
To Reproduce
You should be able to reproduce this issue using a simple test-case included in the SSG repository. I'm using master versions of Mercury, Argobots, Margo, and SSG. You just need to build SSG and the tests:
And run the test case like this:
The text was updated successfully, but these errors were encountered: