NA SM: address lookup errors: ETOOMANYREFS #385

shanedsnyder · 2020-08-14T17:56:23Z

Describe the bug

When performing many concurrent lookups of other node-local processes using the na+sm transport, you can pretty reliably trigger an error on some of the processes:

 NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2535
 # na_sm_addr_event_send(): sendmsg() failed (Too many references: cannot splice)
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2279
 # na_sm_addr_lookup_insert_cb(): Could not send addr events
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:2210
 # na_sm_addr_map_insert(): Could not execute insertion callback
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/na/na_sm.c:3434
 # na_sm_addr_lookup(): Could not insert new address
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury_core.c:1218
 # hg_core_addr_lookup(): Could not lookup address na+sm://237456/0 (NA_PROTOCOL_ERROR)
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury_core.c:3850
 # HG_Core_addr_lookup2(): Could not lookup address
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury.c:1489
 # HG_Addr_lookup2(): Could not lookup na+sm://237456/0 (HG_PROTOCOL_ERROR)

Specifically, the returned error from the sendmsg() call is ETOOMANYREFS, which I found the following notes on in the man page:

       ETOOMANYREFS
              This error can occur for sendmsg(2) when sending a file
              descriptor as ancillary data over a UNIX domain socket (see
              the description of SCM_RIGHTS, above).  It occurs if the
              number of "in-flight" file descriptors exceeds the
              RLIMIT_NOFILE resource limit and the caller does not have the
              CAP_SYS_RESOURCE capability.  An in-flight file descriptor is
              one that has been sent using sendmsg(2) but has not yet been
              accepted in the recipient process using recvmsg(2).

              This error is diagnosed since mainline Linux 4.5 (and in some
              earlier kernel versions where the fix has been backported).
              In earlier kernel versions, it was possible to place an
              unlimited number of file descriptors in flight, by sending
              each file descriptor with sendmsg(2) and then closing the file
              descriptor so that it was not accounted against the
              RLIMIT_NOFILE resource limit.

I am triggering this issue using a SSG group membership service (https://xgitlab.cels.anl.gov/sds/ssg) test case, which uses MPI to collectively start a bunch of na+sm endpoints on a node, then concurrently on each process looks up all other group members' addresses using Argobots threads. It looks like this is causing too many "in flight" file descriptors, exceeding the RLIMIT_NOFILE maximum, which appears to be 1024. I'm able to hit this error pretty frequently using 32+ processes on a node.

Is this something Mercury could somehow detect and respond to, or does that responsibility fall on the users to sensibly limit number of outstanding lookups?

The behavior I've described is all with the master version of Mercury. I just tried with v2.0.0a1 and don't see the issue, but I also notice that the na+sm address lookups clearly are taking a longer time in that version, which may be effectively distributing them over time so as to avoid hitting this maximum.

To Reproduce

You should be able to reproduce this issue using a simple test-case included in the SSG repository. I'm using master versions of Mercury, Argobots, Margo, and SSG. You just need to build SSG and the tests:

make install
make tests

And run the test case like this:

mpirun -n 32 ./tests/ssg-launch-group -s 10 -f gid_file na+sm mpi

The text was updated successfully, but these errors were encountered:

soumagne · 2020-10-09T06:09:01Z

Note to self, in the kernel code:

if (too_many_unix_fds(current))
                return -ETOOMANYREFS;

this is controlled by RLIMIT_NOFILE. We should definitely keep track of that limit too for debugging purposes, and I expect that bumping the limit would work around that issue.

shanedsnyder · 2020-10-09T18:34:01Z

Sorry, should have commented on this when I originally tried it. But yes, the test case I mention above works completely fine if I bump up the number for RLIMIT_NOFILE.

soumagne · 2020-10-09T21:18:16Z

ok great thanks

Track fds that are in use Clean up progress routines

soumagne · 2020-10-17T03:05:20Z

@shanedsnyder should be fixed now, let me know if you have any problems.

shanedsnyder · 2020-10-19T21:40:45Z

Just tried this out and can confirm this error no longer occurs using mercury master. Thanks for the fix @soumagne !

soumagne changed the title ~~na+sm address lookup errors: ETOOMANYREFS~~ NA SM: address lookup errors: ETOOMANYREFS Aug 14, 2020

soumagne added bug major na labels Aug 14, 2020

soumagne added this to the mercury-2.0.0 milestone Sep 3, 2020

soumagne added a commit to soumagne/mercury that referenced this issue Oct 17, 2020

NA SM: defer send of events to send operations (fix mercury-hpc#385)

e6f5d26

Track fds that are in use Clean up progress routines

soumagne closed this as completed in 27818ab Oct 17, 2020

shanedsnyder mentioned this issue Mar 18, 2021

unable to launch 48 members in a group mochi-hpc/mochi-ssg#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA SM: address lookup errors: ETOOMANYREFS #385

NA SM: address lookup errors: ETOOMANYREFS #385

shanedsnyder commented Aug 14, 2020

soumagne commented Oct 9, 2020

shanedsnyder commented Oct 9, 2020

soumagne commented Oct 9, 2020

soumagne commented Oct 17, 2020

shanedsnyder commented Oct 19, 2020

NA SM: address lookup errors: ETOOMANYREFS #385

NA SM: address lookup errors: ETOOMANYREFS #385

Comments

shanedsnyder commented Aug 14, 2020

soumagne commented Oct 9, 2020

shanedsnyder commented Oct 9, 2020

soumagne commented Oct 9, 2020

soumagne commented Oct 17, 2020

shanedsnyder commented Oct 19, 2020