Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal errors sometimes fail to halt the netdata daemon #5896

Closed
mfundul opened this issue Apr 19, 2019 · 9 comments
Closed

Fatal errors sometimes fail to halt the netdata daemon #5896

mfundul opened this issue Apr 19, 2019 · 9 comments
Assignees
Labels
area/daemon bug priority/low A 'nice to have' not critical issue

Comments

@mfundul
Copy link
Contributor

mfundul commented Apr 19, 2019

Bug report summary

Fatal errors sometimes fail to halt the netdata daemon.

OS / Environment

Arch linux

Netdata version (ouput of netdata -V)

PR #5282

Component Name

daemon

Steps To Reproduce

enable the new DB engine

Expected behavior

It should kill the programme and print the error message.

Backtrace
#0  0x00007ffff77fdbef in pthread_rwlock_wrlock () from /usr/lib/libpthread.so.0
#1  0x000055555557ab32 in __netdata_rwlock_wrlock (rwlock=rwlock@entry=0x5555557a9630) at libnetdata/locks/locks.c:182
#2  0x000055555557b5c8 in netdata_rwlock_wrlock_debug (file=file@entry=0x55555563d7f7 "database/rrdhost.c", function=function@entry=0x55555563d920 <__FUNCTION__.24827> "rrdhost_cleanup_charts", line=line@entry=695,
   rwlock=rwlock@entry=0x5555557a9630) at libnetdata/locks/locks.c:279
#3  0x00005555555b7d16 in rrdhost_cleanup_charts (host=host@entry=0x5555557a9440) at database/rrdhost.c:695
#4  0x00005555555b7f82 in rrdhost_cleanup_all () at database/rrdhost.c:742
#5  0x0000555555562456 in netdata_cleanup_and_exit (ret=ret@entry=1) at daemon/main.c:31
#6  0x000055555557d135 in fatal_int (file=file@entry=0x55555563db8b "database/rrdset.c", function=function@entry=0x55555563f410 <__FUNCTION__.24640> "__rrdset_check_rdlock", line=line@entry=11,
   fmt=fmt@entry=0x55555563ddc8 "RRDSET '%s' should be read-locked, but it is not, at function %s() at line %lu of file '%s'") at libnetdata/log/log.c:828
#7  0x00005555555b8783 in __rrdset_check_rdlock (st=0x7fffebe8a000, file=file@entry=0x5555556293b2 "./database/rrd.h", function=function@entry=0x555555635e50 <__FUNCTION__.14823> "rrdset_first_entry_t", line=line@entry=866)
   at database/rrdset.c:11
#8  0x000055555559e24c in rrdset_first_entry_t (st=<optimized out>) at ./database/rrd.h:866
#9  rrdcalc_isrunnable (next_run=<synthetic pointer>, now=1555662290, rc=0x55555582f6f0) at health/health.c:344
#10 health_main (ptr=0x55555566ce68 <static_threads+616>) at health/health.c:529
#11 0x0000555555588545 in thread_start (ptr=<optimized out>) at libnetdata/threads/threads.c:126
#12 0x00007ffff77f8a9d in start_thread () from /usr/lib/libpthread.so.0
#13 0x00007ffff7728af3 in clone () from /usr/lib/libc.so.6```
@vlvkobal
Copy link
Contributor

(gdb) info threads
  Id   Target Id                                   Frame
  1    Thread 0x7ffff6a2c840 (LWP 11447) "netdata" 0x00007ffff7802a02 in pause () from /usr/lib/libpthread.so.0
  2    Thread 0x7ffff67fc700 (LWP 11468) "netdata" 0x00007ffff7728c4e in epoll_pwait () from /usr/lib/libc.so.6
  3    Thread 0x7ffff5ffb700 (LWP 11469) "netdata" 0x00007ffff77fd6de in pthread_rwlock_rdlock () from /usr/lib/libpthread.so.0
  4    Thread 0x7ffff57fa700 (LWP 11470) "netdata" 0x00007ffff7802850 in nanosleep () from /usr/lib/libpthread.so.0
  5    Thread 0x7ffff4ff9700 (LWP 11471) "netdata" 0x00007ffff77fdb35 in pthread_rwlock_wrlock () from /usr/lib/libpthread.so.0
  6    Thread 0x7ffff47f8700 (LWP 11472) "netdata" 0x00007ffff7719774 in read () from /usr/lib/libc.so.6
  7    Thread 0x7ffff3ff7700 (LWP 11474) "netdata" 0x00007ffff7802850 in nanosleep () from /usr/lib/libpthread.so.0
  8    Thread 0x7ffff37f6700 (LWP 11475) "netdata" 0x00007ffff7802850 in nanosleep () from /usr/lib/libpthread.so.0
  10   Thread 0x7ffff27f4700 (LWP 11477) "netdata" 0x00007ffff771dbf1 in poll () from /usr/lib/libc.so.6
  11   Thread 0x7ffff1ff3700 (LWP 11478) "netdata" 0x00007ffff77fd6de in pthread_rwlock_rdlock () from /usr/lib/libpthread.so.0
  12   Thread 0x7ffff2ff5700 (LWP 11479) "netdata" 0x00007ffff76f5688 in nanosleep () from /usr/lib/libc.so.6
  13   Thread 0x7ffff17f2700 (LWP 11480) "netdata" 0x00007ffff771dbf1 in poll () from /usr/lib/libc.so.6
  14   Thread 0x7ffff0ff1700 (LWP 11481) "netdata" 0x00007ffff77fdbef in pthread_rwlock_wrlock () from /usr/lib/libpthread.so.0
* 17   Thread 0x7fffef7ee700 (LWP 11490) "netdata" 0x00007ffff76f5688 in nanosleep () from /usr/lib/libc.so.6
  19   Thread 0x7fffee7ec700 (LWP 11497) "netdata" 0x00007ffff7719774 in read () from /usr/lib/libc.so.6
  22   Thread 0x7fffecfe9700 (LWP 11517) "netdata" 0x00007ffff7719774 in read () from /usr/lib/libc.so.6

@vlvkobal
Copy link
Contributor

(gdb) where
#0  0x00007ffff76f5688 in nanosleep () from /usr/lib/libc.so.6
#1  0x00007ffff76f558e in sleep () from /usr/lib/libc.so.6
#2  0x00005555555aa0a9 in pluginsd_worker_thread (arg=0x5555559a5190) at collectors/plugins.d/plugins_d.c:560
#3  0x0000555555588545 in thread_start (ptr=<optimized out>) at libnetdata/threads/threads.c:126
#4  0x00007ffff77f8a9d in start_thread () from /usr/lib/libpthread.so.0
#5  0x00007ffff7728af3 in clone () from /usr/lib/libc.so.6

@vlvkobal
Copy link
Contributor

Netdata was installed using this command:

sudo CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --enable-plugin-nfacct --enable-plugin-freeipmi --disable-go --disable-lto --dont-wait

@mfundul
Copy link
Contributor Author

mfundul commented Apr 19, 2019

It's no longer easily reproducible, you need to check out commit b50b839 or older.

@netdatabot netdatabot added bug needs triage Issues which need to be manually labelled labels Apr 20, 2019
@paulkatsoulakis
Copy link
Contributor

We need to check the stack trace on all the threads that are blocked in pthread_rwlock_wrlock
execute use THREAD_NUM for the respective thread numbers and then do where to see the trace from each

@cakrit cakrit added priority/low A 'nice to have' not critical issue and removed needs triage Issues which need to be manually labelled labels Apr 30, 2019
@mfundul mfundul closed this as completed May 7, 2019
@vlvkobal
Copy link
Contributor

vlvkobal commented May 8, 2019

I'll try to reproduce this error with the old commit.

@vlvkobal vlvkobal reopened this May 8, 2019
@vlvkobal vlvkobal assigned vlvkobal and unassigned mfundul May 8, 2019
@cakrit
Copy link
Contributor

cakrit commented Jul 25, 2019

Killing this issue, I asked another person as well to try to reproduce it and he couldn't. If we ever see it again, we pick up in a new issue.

@mfundul
Copy link
Contributor Author

mfundul commented Nov 5, 2019

The problem here is that fatal() will attempt to take some locks that may be already taken, and thusly run into a deadlock.

Let's keep #7253 in mind when trying to reproduce this.

@mfundul mfundul reopened this Nov 5, 2019
@mfundul mfundul assigned mfundul and unassigned vlvkobal Nov 5, 2019
@mfundul mfundul removed their assignment Feb 26, 2021
@stelfrag stelfrag self-assigned this Apr 2, 2021
@stelfrag
Copy link
Collaborator

stelfrag commented Jul 9, 2021

Not reproduced

@stelfrag stelfrag closed this as completed Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon bug priority/low A 'nice to have' not critical issue
Projects
None yet
Development

No branches or pull requests

6 participants