Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

reported pmDestroyContext: pthread_mutex_destroy(c_lock) failed: Device or resource busy, failed second try as well #118

Closed
fche opened this Issue Oct 4, 2016 · 11 comments

Comments

Projects
None yet
3 participants
Contributor

fche commented Oct 4, 2016 edited

Filing here with incomplete information, so issue is not forgotten:

@hc000 on irc reported an instance where, using pcp 3.11.5, possibly two copies of vector talking to the same pmwebd & same remote pmcd, occasionally suffering 1.8s latencies for queries.

pmwebd logs indicate this, running with pmwebd -Dpdu, and maybe also pmcd -t1 to induce timeout errors for those 1.8s requests.

pmwebd(14086): context web103761=pm2) expired. pmDestroyContext: 
pthread_mutex_destroy(c_lock) failed: Device or resource busy, failed second try as well

Given that this part of pmwebd is strictly single-threaded, this seems to implicate libpcp remote/pdu error handling, and/or the sharing of sockets within libpcp toward the same remote pmcd. It may be worth finally dropping the latter functionality, as a prophylactic.

Contributor

fche commented Oct 20, 2016

Similar phenomenon observed in pmmgr stock 3.11.5. Intermittently, during the multithreaded phase where each thread opens a brief pmNewContext to a hypothetical address, we have recently started getting hangs. This seems to occur when a particular target host dies. I happen to have a box that disconnect/reconnects every few minutes on my network, so error-handling parts of pmmgr/libpcp get probably more of a workout than for a normal site.

Anyway, the observed situation is a hang of the pmmgr process:

(gdb) bt
#0  0x00007f29a96ae6bd in pthread_join (threadid=139812543633152, 
    thread_return=0x0) at pthread_join.c:90
#1  0x00005654f23ae60b in pmmgr_job_spec::parallel_do (this=, 
    num_threads=, 
    fn=0x5654f23b0f60 , data=0x7ffd1394b600)
    at pmmgr.cxx:653
#2  0x00005654f23b1854 in pmmgr_job_spec::poll (this=0x5654f33b4c70)
    at pmmgr.cxx:721
#3  0x00005654f23a4729 in main (argc=, argv=)
    at pmmgr.cxx:1791

A few threads are alive, one looks suspicious:

(gdb) info thread
  Id   Target Id         Frame 
* 1    Thread 0x7f29a9c848c0 (LWP 17870) "pmmgr" 0x00005654f23ae60b in pmmgr_job_spec::parallel_do (this=, num_threads=, 
    fn=0x5654f23b0f60 , data=0x7ffd1394b600)
    at pmmgr.cxx:653
  2    Thread 0x7f28a4ff9700 (LWP 21746) "pmmgr" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  3    Thread 0x7f28a57fa700 (LWP 21747) "pmmgr" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  4    Thread 0x7f28a67fc700 (LWP 21749) "pmmgr" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  5    Thread 0x7f28a7fff700 (LWP 21751) "pmmgr" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  6    Thread 0x7f28a77fe700 (LWP 21752) "pmmgr" 0x00007f29a96b610f in __libc_recv
    (fd=fd@entry=11, buf=buf@entry=0x7f29700090ac, n=n@entry=12, 
    flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f28a77fe700 (LWP 21752))]
#0  0x00007f29a96b610f in __libc_recv (fd=fd@entry=11, 
    buf=buf@entry=0x7f29700090ac, n=n@entry=12, flags=flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
28    return SYSCALL_CANCEL (recvfrom, fd, buf, n, flags, NULL, NULL);
(gdb) l
23     Returns the number read or -1 for errors.  */
24  
25  ssize_t
26  __libc_recv (int fd, void *buf, size_t n, int flags)
27  {
28    return SYSCALL_CANCEL (recvfrom, fd, buf, n, flags, NULL, NULL);
29  }
30  
31  weak_alias (__libc_recv, __recv)
32  libc_hidden_weak (__recv)
(gdb) p fd
$2 = 11

But what's fd 11?

sudo lsof -p 17870
[...]
pmmgr   17870  pcp    3u  IPv4 28309694      0t0      TCP very.local:34448->very.local:pmcd (ESTABLISHED)
pmmgr   17870  pcp    4u  IPv4 28305018      0t0      TCP very.local:40486->very.local:pmcd (ESTABLISHED)
pmmgr   17870  pcp    6u  IPv4 28303567      0t0      TCP very.elastic.org:42912->dhcp10.elastic.org:pmcd (ESTABLISHED)
pmmgr   17870  pcp   10u  sock      0,8      0t0 28313674 protocol: TCP

Whoa, no record of fd-11. It has already been closed, maybe by another thread connecting to the same IP address (via another host name?), and having recovered from the failure. Let's resume.

(gdb) q
% tail -NNN pmmgr.log
[Sat Oct 15 02:12:03] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(11): __pmSetSockOpt TCP_NODELAY: Bad file descriptor
[Sat Oct 15 02:12:03] pmmgr(17870) Error: auxconnect.c:__pmConnectTo: cannot set FNDELAY - fcntl(11,F_SETFL,0xffffffff) failed: Bad file descriptor
[Sat Oct 15 02:13:03] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(8): __pmSetSockOpt TCP_NODELAY: Bad file descriptor
[Sat Oct 15 02:13:03] pmmgr(17870) Error: auxconnect.c:__pmCreateIPv6Socket: IPV6 is not supported
[Sat Oct 15 02:13:03] pmmgr(17870) Error: auxconnect.c:__pmCreateIPv6Socket: IPV6 is not supported
[Sat Oct 15 02:14:04] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(11): __pmSetSockOpt TCP_NODELAY: Bad file descriptor
[Sat Oct 15 02:14:04] pmmgr(17870) Error: __pmGetPDU: fd=11 illegal PDU len=1 in hdr
[Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmGetPDU: fd=11 data read: len=-1: Bad file descriptor
[Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmGetPDU: PDU hdr: len=0x18 type=0x700d from=0x0
[Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmCloseChannelbyContext: fd=11 context=6 expected PDU_PMNS_IDS received: IPC protocol failure
[Thu Oct 20 15:00:24] pmmgr(17870) Error: __pmCloseChannelbyContext: fd=6 context=3 expected PDU_PMNS_IDS received: Timeout waiting for a response from PMCD
pmDestroyContext: pthread_mutex_destroy(c_lock) failed: Device or resource busy
pmDestroyContext: pthread_mutex_destroy(c_lock) failed second try: Device or resource busy

There is the bad c_lock error message again.

ISTM there is a recent regression in the multithreading and/or error-handling code in libpcp, probably associated with sharing sockets between contexts.

Contributor

kmcdonell commented Oct 20, 2016

Frank, thanks for the detective work.

This may be related to the relatively recent "disconnect" logic changes in libpcp that were aimed at cases where pduread() being interrupted, but potentially implicate any error patch from a socket read failure.

In your failure case (if it is reproducible), what's the callstack above __libc_recv() ... I'm guessing it leads to pduread() but I'm most interested in who's calling pduread() ... specifically is it a pmFetch() or a metadata operation or the handshake during pmNewContext()?

Also the pmmgr.log tail has entries from two days, I'm guessing only the Oct 20 ones are relevant to this problem.

And finally, does pmmgr create multiple PCP contexts for the same PMCD host (that may help me construct a simpler failure scenario outside pmmgr for QA).

Contributor

fche commented Oct 20, 2016

The libc_recv call stack, just happen to have it saved in my terminal:

(gdb) bt
#0  0x00007f29a96b610f in __libc_recv (fd=fd@entry=11, 
    buf=buf@entry=0x7f29700090ac, n=n@entry=12, flags=flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
#1  0x00007f29a992155e in recv (__flags=0, __n=12, __buf=0x7f29700090ac, __fd=11)
    at /usr/include/bits/socket2.h:44
#2  __pmRecv (fd=fd@entry=11, buffer=buffer@entry=0x7f29700090ac, 
    length=length@entry=12, flags=flags@entry=0) at secureconnect.c:1597
#3  0x00007f29a98dfff2 in pduread (fd=fd@entry=11, 
    buf=buf@entry=0x7f29700090ac ")\177", len=len@entry=12, part=part@entry=0, 
    timeout=timeout@entry=10) at pdu.c:212
#4  0x00007f29a98e0c9f in __pmGetPDU (fd=11, mode=mode@entry=0, timeout=10, 
    result=result@entry=0x7f28a77fd4b8) at pdu.c:495
#5  0x00007f29a98e46ad in pmLookupName (numpmid=numpmid@entry=1, 
    namelist=namelist@entry=0x7f29700050b0, pmidlist=0x7f29700050d0)
    at pmns.c:1703
#6  0x00007f29a990dae2 in bind_expr (n=n@entry=2, np=0x7f297000a1b0)
    at derive.c:616
#7  0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f2970009020)
    at derive.c:578
#8  0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f297000a200)
    at derive.c:578
#9  0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f297000a320)
    at derive.c:578
#10 0x00007f29a990ff8f in __dmopencontext (ctxp=ctxp@entry=0x7f2968000b20)

so derived-metric setup related, so probably early in pmNewContext.
The hang must have started Oct. 15, and is probably relevant. I believe at least one prior instance of this occurred at the end of September.

pmmgr can create multiple contexts to the same host, if the discovery / explicit-list happens to return the equivalent pcp:// url / hostname / IPs multiple times.

Contributor

goodwinos commented Nov 2, 2016

Frank, so do you suspect nuking shared sockets to the same pmcd entirely fixes this issue (can it be repro'd?).

Also, is pmwebd supposed to support derived metrics? (afaik yes). But if it doesn't then it should call __pmSetInternalState(PM_STATE_PMCS) fairly early on so __dminit() will just return immediately and derived metrics will be disabled.

Contributor

fche commented Nov 2, 2016

Mark, the issue was always intermittent, so we're working backward from the symptoms (with the c_lock messages etc.), and don't have proof that this was the only cause of the symptoms. It's nice to eliminate the whole possibility though.

Also, is pmwebd supposed to support derived metrics?

In principle, a pmwebd client could ask for derived metrics, so "yes". But I bet that is only one possible path to trigger this problem.

Contributor

fche commented Nov 17, 2016

The phenomenon reported re. pmmgr keeps occurring. If not with my socket-unsharing prophylactic patch, do you have any other suggestion?

Contributor

fche commented Nov 17, 2016

Possibly relevant: intermittent crashes of pmmgr are observed with error messages like:

Unexpected error 9 on netlink descriptor 7

This message comes from glibc, and is basically an assertion failure. glibc 2.23's release notes describe the situation thusly:

getaddrinfo now detects certain invalid responses on an internal netlink
socket. If such responses are received, an affected process will
terminate with an error message of "Unexpected error on netlink
descriptor " or "Unexpected netlink response of size on
descriptor ". The most likely cause for these errors is a
multi-threaded application which erroneously closes and reuses the netlink
file descriptor while it is used by getaddrinfo.

again implicating multithreading & socket handling in libpcp.

Contributor

fche commented Jun 11, 2017

The above assertion-failure from glibc re.

Unexpected error 9 on netlink descriptor 7

still happens with current git-master level libpcp, so after the pmcd socket unsharing. Luckily, systemd restarts pmmgr right away.

Contributor

kmcdonell commented Jul 4, 2017

I'd really like to hear if this problem occurs with the current code base (3.12.0 or later) as the locking logic around pmDestroyContext() has dramatically changed and the multiplexing of the client-pmcd socket has gone away.

I plan to close this issue at the end of July unless new evidence appears before then.

Contributor

kmcdonell commented Aug 3, 2017

As per my earlier comment, I believe this is all done (or at worse morphed into something quite different that is better tracked in another issue).

@kmcdonell kmcdonell closed this Aug 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment