Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA SM: address lookup errors: ETOOMANYREFS #385

Closed
shanedsnyder opened this issue Aug 14, 2020 · 5 comments
Closed

NA SM: address lookup errors: ETOOMANYREFS #385

shanedsnyder opened this issue Aug 14, 2020 · 5 comments

Comments

@shanedsnyder
Copy link

Describe the bug

When performing many concurrent lookups of other node-local processes using the na+sm transport, you can pretty reliably trigger an error on some of the processes:

 NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2535
 # na_sm_addr_event_send(): sendmsg() failed (Too many references: cannot splice)
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2279
 # na_sm_addr_lookup_insert_cb(): Could not send addr events
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2210
 # na_sm_addr_map_insert(): Could not execute insertion callback
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:3434
 # na_sm_addr_lookup(): Could not insert new address
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury_core.c:1218
 # hg_core_addr_lookup(): Could not lookup address na+sm://237456/0 (NA_PROTOCOL_ERROR)
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury_core.c:3850
 # HG_Core_addr_lookup2(): Could not lookup address
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury.c:1489
 # HG_Addr_lookup2(): Could not lookup na+sm://237456/0 (HG_PROTOCOL_ERROR)

Specifically, the returned error from the sendmsg() call is ETOOMANYREFS, which I found the following notes on in the man page:

       ETOOMANYREFS
              This error can occur for sendmsg(2) when sending a file
              descriptor as ancillary data over a UNIX domain socket (see
              the description of SCM_RIGHTS, above).  It occurs if the
              number of "in-flight" file descriptors exceeds the
              RLIMIT_NOFILE resource limit and the caller does not have the
              CAP_SYS_RESOURCE capability.  An in-flight file descriptor is
              one that has been sent using sendmsg(2) but has not yet been
              accepted in the recipient process using recvmsg(2).

              This error is diagnosed since mainline Linux 4.5 (and in some
              earlier kernel versions where the fix has been backported).
              In earlier kernel versions, it was possible to place an
              unlimited number of file descriptors in flight, by sending
              each file descriptor with sendmsg(2) and then closing the file
              descriptor so that it was not accounted against the
              RLIMIT_NOFILE resource limit.

I am triggering this issue using a SSG group membership service (https://xgitlab.cels.anl.gov/sds/ssg) test case, which uses MPI to collectively start a bunch of na+sm endpoints on a node, then concurrently on each process looks up all other group members' addresses using Argobots threads. It looks like this is causing too many "in flight" file descriptors, exceeding the RLIMIT_NOFILE maximum, which appears to be 1024. I'm able to hit this error pretty frequently using 32+ processes on a node.

Is this something Mercury could somehow detect and respond to, or does that responsibility fall on the users to sensibly limit number of outstanding lookups?

The behavior I've described is all with the master version of Mercury. I just tried with v2.0.0a1 and don't see the issue, but I also notice that the na+sm address lookups clearly are taking a longer time in that version, which may be effectively distributing them over time so as to avoid hitting this maximum.

To Reproduce

You should be able to reproduce this issue using a simple test-case included in the SSG repository. I'm using master versions of Mercury, Argobots, Margo, and SSG. You just need to build SSG and the tests:

make install
make tests

And run the test case like this:

mpirun -n 32 ./tests/ssg-launch-group -s 10 -f gid_file na+sm mpi
@soumagne soumagne changed the title na+sm address lookup errors: ETOOMANYREFS NA SM: address lookup errors: ETOOMANYREFS Aug 14, 2020
@soumagne soumagne added this to the mercury-2.0.0 milestone Sep 3, 2020
@soumagne
Copy link
Member

soumagne commented Oct 9, 2020

Note to self, in the kernel code:

if (too_many_unix_fds(current))
                return -ETOOMANYREFS;

this is controlled by RLIMIT_NOFILE. We should definitely keep track of that limit too for debugging purposes, and I expect that bumping the limit would work around that issue.

@shanedsnyder
Copy link
Author

Sorry, should have commented on this when I originally tried it. But yes, the test case I mention above works completely fine if I bump up the number for RLIMIT_NOFILE.

@soumagne
Copy link
Member

soumagne commented Oct 9, 2020

ok great thanks

soumagne added a commit to soumagne/mercury that referenced this issue Oct 17, 2020
Track fds that are in use

Clean up progress routines
@soumagne
Copy link
Member

@shanedsnyder should be fixed now, let me know if you have any problems.

@shanedsnyder
Copy link
Author

Just tried this out and can confirm this error no longer occurs using mercury master. Thanks for the fix @soumagne !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants