Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
reported pmDestroyContext: pthread_mutex_destroy(c_lock) failed: Device or resource busy, failed second try as well #118
Comments
|
Similar phenomenon observed in pmmgr stock 3.11.5. Intermittently, during the multithreaded phase where each thread opens a brief pmNewContext to a hypothetical address, we have recently started getting hangs. This seems to occur when a particular target host dies. I happen to have a box that disconnect/reconnects every few minutes on my network, so error-handling parts of pmmgr/libpcp get probably more of a workout than for a normal site. Anyway, the observed situation is a hang of the pmmgr process: (gdb) bt
#0 0x00007f29a96ae6bd in pthread_join (threadid=139812543633152,
thread_return=0x0) at pthread_join.c:90
#1 0x00005654f23ae60b in pmmgr_job_spec::parallel_do (this=,
num_threads=,
fn=0x5654f23b0f60 , data=0x7ffd1394b600)
at pmmgr.cxx:653
#2 0x00005654f23b1854 in pmmgr_job_spec::poll (this=0x5654f33b4c70)
at pmmgr.cxx:721
#3 0x00005654f23a4729 in main (argc=, argv=)
at pmmgr.cxx:1791
A few threads are alive, one looks suspicious: (gdb) info thread
Id Target Id Frame
* 1 Thread 0x7f29a9c848c0 (LWP 17870) "pmmgr" 0x00005654f23ae60b in pmmgr_job_spec::parallel_do (this=, num_threads=,
fn=0x5654f23b0f60 , data=0x7ffd1394b600)
at pmmgr.cxx:653
2 Thread 0x7f28a4ff9700 (LWP 21746) "pmmgr" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
3 Thread 0x7f28a57fa700 (LWP 21747) "pmmgr" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
4 Thread 0x7f28a67fc700 (LWP 21749) "pmmgr" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
5 Thread 0x7f28a7fff700 (LWP 21751) "pmmgr" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
6 Thread 0x7f28a77fe700 (LWP 21752) "pmmgr" 0x00007f29a96b610f in __libc_recv
(fd=fd@entry=11, buf=buf@entry=0x7f29700090ac, n=n@entry=12,
flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f28a77fe700 (LWP 21752))]
#0 0x00007f29a96b610f in __libc_recv (fd=fd@entry=11,
buf=buf@entry=0x7f29700090ac, n=n@entry=12, flags=flags@entry=0)
at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
28 return SYSCALL_CANCEL (recvfrom, fd, buf, n, flags, NULL, NULL);
(gdb) l
23 Returns the number read or -1 for errors. */
24
25 ssize_t
26 __libc_recv (int fd, void *buf, size_t n, int flags)
27 {
28 return SYSCALL_CANCEL (recvfrom, fd, buf, n, flags, NULL, NULL);
29 }
30
31 weak_alias (__libc_recv, __recv)
32 libc_hidden_weak (__recv)
(gdb) p fd
$2 = 11
But what's fd 11? sudo lsof -p 17870 [...] pmmgr 17870 pcp 3u IPv4 28309694 0t0 TCP very.local:34448->very.local:pmcd (ESTABLISHED) pmmgr 17870 pcp 4u IPv4 28305018 0t0 TCP very.local:40486->very.local:pmcd (ESTABLISHED) pmmgr 17870 pcp 6u IPv4 28303567 0t0 TCP very.elastic.org:42912->dhcp10.elastic.org:pmcd (ESTABLISHED) pmmgr 17870 pcp 10u sock 0,8 0t0 28313674 protocol: TCP Whoa, no record of fd-11. It has already been closed, maybe by another thread connecting to the same IP address (via another host name?), and having recovered from the failure. Let's resume. (gdb) q % tail -NNN pmmgr.log [Sat Oct 15 02:12:03] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(11): __pmSetSockOpt TCP_NODELAY: Bad file descriptor [Sat Oct 15 02:12:03] pmmgr(17870) Error: auxconnect.c:__pmConnectTo: cannot set FNDELAY - fcntl(11,F_SETFL,0xffffffff) failed: Bad file descriptor [Sat Oct 15 02:13:03] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(8): __pmSetSockOpt TCP_NODELAY: Bad file descriptor [Sat Oct 15 02:13:03] pmmgr(17870) Error: auxconnect.c:__pmCreateIPv6Socket: IPV6 is not supported [Sat Oct 15 02:13:03] pmmgr(17870) Error: auxconnect.c:__pmCreateIPv6Socket: IPV6 is not supported [Sat Oct 15 02:14:04] pmmgr(17870) Warning: auxconnect.c:__pmCreateSocket(11): __pmSetSockOpt TCP_NODELAY: Bad file descriptor [Sat Oct 15 02:14:04] pmmgr(17870) Error: __pmGetPDU: fd=11 illegal PDU len=1 in hdr [Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmGetPDU: fd=11 data read: len=-1: Bad file descriptor [Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmGetPDU: PDU hdr: len=0x18 type=0x700d from=0x0 [Thu Oct 20 15:00:11] pmmgr(17870) Error: __pmCloseChannelbyContext: fd=11 context=6 expected PDU_PMNS_IDS received: IPC protocol failure [Thu Oct 20 15:00:24] pmmgr(17870) Error: __pmCloseChannelbyContext: fd=6 context=3 expected PDU_PMNS_IDS received: Timeout waiting for a response from PMCD pmDestroyContext: pthread_mutex_destroy(c_lock) failed: Device or resource busy pmDestroyContext: pthread_mutex_destroy(c_lock) failed second try: Device or resource busy There is the bad c_lock error message again. ISTM there is a recent regression in the multithreading and/or error-handling code in libpcp, probably associated with sharing sockets between contexts. |
|
Frank, thanks for the detective work. This may be related to the relatively recent "disconnect" logic changes in libpcp that were aimed at cases where pduread() being interrupted, but potentially implicate any error patch from a socket read failure. In your failure case (if it is reproducible), what's the callstack above __libc_recv() ... I'm guessing it leads to pduread() but I'm most interested in who's calling pduread() ... specifically is it a pmFetch() or a metadata operation or the handshake during pmNewContext()? Also the pmmgr.log tail has entries from two days, I'm guessing only the Oct 20 ones are relevant to this problem. And finally, does pmmgr create multiple PCP contexts for the same PMCD host (that may help me construct a simpler failure scenario outside pmmgr for QA). |
|
The libc_recv call stack, just happen to have it saved in my terminal: (gdb) bt
#0 0x00007f29a96b610f in __libc_recv (fd=fd@entry=11,
buf=buf@entry=0x7f29700090ac, n=n@entry=12, flags=flags@entry=0)
at ../sysdeps/unix/sysv/linux/x86_64/recv.c:28
#1 0x00007f29a992155e in recv (__flags=0, __n=12, __buf=0x7f29700090ac, __fd=11)
at /usr/include/bits/socket2.h:44
#2 __pmRecv (fd=fd@entry=11, buffer=buffer@entry=0x7f29700090ac,
length=length@entry=12, flags=flags@entry=0) at secureconnect.c:1597
#3 0x00007f29a98dfff2 in pduread (fd=fd@entry=11,
buf=buf@entry=0x7f29700090ac ")\177", len=len@entry=12, part=part@entry=0,
timeout=timeout@entry=10) at pdu.c:212
#4 0x00007f29a98e0c9f in __pmGetPDU (fd=11, mode=mode@entry=0, timeout=10,
result=result@entry=0x7f28a77fd4b8) at pdu.c:495
#5 0x00007f29a98e46ad in pmLookupName (numpmid=numpmid@entry=1,
namelist=namelist@entry=0x7f29700050b0, pmidlist=0x7f29700050d0)
at pmns.c:1703
#6 0x00007f29a990dae2 in bind_expr (n=n@entry=2, np=0x7f297000a1b0)
at derive.c:616
#7 0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f2970009020)
at derive.c:578
#8 0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f297000a200)
at derive.c:578
#9 0x00007f29a990d9a8 in bind_expr (n=n@entry=2, np=0x7f297000a320)
at derive.c:578
#10 0x00007f29a990ff8f in __dmopencontext (ctxp=ctxp@entry=0x7f2968000b20)
so derived-metric setup related, so probably early in pmNewContext. pmmgr can create multiple contexts to the same host, if the discovery / explicit-list happens to return the equivalent pcp:// url / hostname / IPs multiple times. |
|
... appears to improve on the situation. |
|
Frank, so do you suspect nuking shared sockets to the same pmcd entirely fixes this issue (can it be repro'd?). Also, is pmwebd supposed to support derived metrics? (afaik yes). But if it doesn't then it should call __pmSetInternalState(PM_STATE_PMCS) fairly early on so __dminit() will just return immediately and derived metrics will be disabled. |
|
Mark, the issue was always intermittent, so we're working backward from the symptoms (with the c_lock messages etc.), and don't have proof that this was the only cause of the symptoms. It's nice to eliminate the whole possibility though.
In principle, a pmwebd client could ask for derived metrics, so "yes". But I bet that is only one possible path to trigger this problem. |
|
The phenomenon reported re. pmmgr keeps occurring. If not with my socket-unsharing prophylactic patch, do you have any other suggestion? |
|
Possibly relevant: intermittent crashes of pmmgr are observed with error messages like:
This message comes from glibc, and is basically an assertion failure. glibc 2.23's release notes describe the situation thusly:
again implicating multithreading & socket handling in libpcp. |
|
The above assertion-failure from glibc re.
still happens with current git-master level libpcp, so after the pmcd socket unsharing. Luckily, systemd restarts pmmgr right away. |
|
I'd really like to hear if this problem occurs with the current code base (3.12.0 or later) as the locking logic around pmDestroyContext() has dramatically changed and the multiplexing of the client-pmcd socket has gone away. I plan to close this issue at the end of July unless new evidence appears before then. |
|
As per my earlier comment, I believe this is all done (or at worse morphed into something quite different that is better tracked in another issue). |
fche commentedOct 4, 2016
•
edited
Filing here with incomplete information, so issue is not forgotten:
@hc000 on irc reported an instance where, using pcp 3.11.5, possibly two copies of vector talking to the same pmwebd & same remote pmcd, occasionally suffering 1.8s latencies for queries.
pmwebd logs indicate this, running with pmwebd -Dpdu, and maybe also
pmcd -t1to induce timeout errors for those 1.8s requests.Given that this part of pmwebd is strictly single-threaded, this seems to implicate libpcp remote/pdu error handling, and/or the sharing of sockets within libpcp toward the same remote pmcd. It may be worth finally dropping the latter functionality, as a prophylactic.