Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on NetData v1.14.0-51-g18336910 #6013

Closed
justinmayer opened this issue May 14, 2019 · 10 comments · Fixed by #6011
Closed

Segfault on NetData v1.14.0-51-g18336910 #6013

justinmayer opened this issue May 14, 2019 · 10 comments · Fixed by #6011
Labels

Comments

@justinmayer
Copy link

Bug report summary

I have a master NetData instance, with multiple servers streaming to it. In all cases, NetData was installed via netdata-installer.sh on the stable channel.

I just installed NetData v1.14.0 on a to-be-monitored server. When I connected to the master NetData dashboard running v.1.13 via the usual SSH-tunneled http://localhost:19999, the console notified me about the new version, so I upgraded the master NetData instance to version v1.14.0-51-g18336910. I also took the liberty of running sudo apt dist-upgrade on that host, which is running Ubuntu 18.04.2.

Upon restarting the master NetData instance, its dashboard would no longer load via the SSH tunnel, with the following printed in the SSH session:

channel 3: open failed: connect failed: Connection refused

Running nc -vz localhost 19999 directly on the NetData master yielded:

nc: connect to localhost port 19999 (tcp) failed: Connection refused

Running tail -f /var/log/netdata/error.log appears to show a steady stream of incoming reports from monitored servers.

Running sudo tail -f /var/log/syslog shows entries including:

May 14 09:01:48 netdata systemd[1]: Started Real time performance monitoring.
May 14 09:01:48 netdata kernel: [ 4340.337011] netdata[1036]: segfault at 0 ip 00007faadb0bb646 sp 00007faad318b648 error 4 in libc-2.27.so[7faadb00a000+1e7000]
May 14 09:01:49 netdata systemd[1]: netdata.service: Main process exited, code=killed, status=11/SEGV
May 14 09:01:49 netdata systemd[1]: netdata.service: Failed with result 'signal'.

I see that PR #6011 was just merged and mentions segfaults, but it's not clear from the PR description whether that PR is related to the issue I am reporting here.

If that merged PR is likely to fix this problem, may I humbly suggest that the fix be included in an imminent v1.14.1 release?

As an aside, is there a way to downgrade? In other words, can I invoke the netdata-installer.sh script and somehow specify v1.13.0 so that version is re-installed?

OS / Environment

Ubuntu 18.04.2 on NetData master. Monitored servers are running either Ubuntu 16.04 or Ubuntu 18.04.

Netdata version (ouput of netdata -V)

v1.14.0-51-g18336910

@mfundul
Copy link
Contributor

mfundul commented May 14, 2019

As far as I know this bug was introduced in the github master branch yesterday, and fixed today, but it's not a problem for NetData v1.14.0.

@justinmayer
Copy link
Author

In that case, it seems the issue I reported is unrelated to that pull request, since I am running the v1.14.0-51-g18336910 (presumably stable channel) release version.

@vlvkobal
Copy link
Contributor

No. v1.14.0-51-g18336910 is the latest version with the last PR #6011 merged. Could you run your netdata with valgrind and provide a backtrace?

@justinmayer
Copy link
Author

@vlvkobal: I'd be happy to. Could you tell me how exactly to do that, or point me to the instructions that would indicate precisely how that's done?

@vlvkobal
Copy link
Contributor

sudo systemctl stop netdata
sudo valgrind netdata -D

@justinmayer
Copy link
Author

$ sudo valgrind netdata -D
==31756== Memcheck, a memory error detector
==31756== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==31756== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==31756== Command: netdata -D
==31756== 
==31756== Thread 13:
==31756== Invalid read of size 1
==31756==    at 0x4C32CF2: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31756==    by 0x4ED99AD: strdup (strdup.c:41)
==31756==    by 0x1A7E18: strdupz (in /usr/sbin/netdata)
==31756==    by 0x15DBF4: rrdhost_create (in /usr/sbin/netdata)
==31756==    by 0x1657E4: rrdpush_receive (in /usr/sbin/netdata)
==31756==    by 0x165BA6: rrdpush_receiver_thread.lto_priv.93 (in /usr/sbin/netdata)
==31756==    by 0x14FB2E: thread_start (in /usr/sbin/netdata)
==31756==    by 0x59F66DA: start_thread (pthread_create.c:463)
==31756==    by 0x4F5D88E: clone (clone.S:95)
==31756==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==31756== 
==31756== 
==31756== Process terminating with default action of signal 11 (SIGSEGV)
==31756==  Access not within mapped region at address 0x0
==31756==    at 0x4C32CF2: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31756==    by 0x4ED99AD: strdup (strdup.c:41)
==31756==    by 0x1A7E18: strdupz (in /usr/sbin/netdata)
==31756==    by 0x15DBF4: rrdhost_create (in /usr/sbin/netdata)
==31756==    by 0x1657E4: rrdpush_receive (in /usr/sbin/netdata)
==31756==    by 0x165BA6: rrdpush_receiver_thread.lto_priv.93 (in /usr/sbin/netdata)
==31756==    by 0x14FB2E: thread_start (in /usr/sbin/netdata)
==31756==    by 0x59F66DA: start_thread (pthread_create.c:463)
==31756==    by 0x4F5D88E: clone (clone.S:95)
==31756==  If you believe this happened as a result of a stack
==31756==  overflow in your program's main thread (unlikely but
==31756==  possible), you can try to increase the size of the
==31756==  main thread stack using the --main-stacksize= flag.
==31756==  The main thread stack size used in this run was 8388608.
==31756== 
==31756== HEAP SUMMARY:
==31756==     in use at exit: 2,133,932 bytes in 38,768 blocks
==31756==   total heap usage: 68,134 allocs, 29,366 frees, 5,607,281 bytes allocated
==31756== 
==31756== LEAK SUMMARY:
==31756==    definitely lost: 227 bytes in 2 blocks
==31756==    indirectly lost: 0 bytes in 0 blocks
==31756==      possibly lost: 3,840 bytes in 12 blocks
==31756==    still reachable: 2,129,865 bytes in 38,754 blocks
==31756==         suppressed: 0 bytes in 0 blocks
==31756== Rerun with --leak-check=full to see details of leaked memory
==31756== 
==31756== For counts of detected and suppressed errors, rerun with: -v
==31756== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==31756== could not unlink /tmp/vgdb-pipe-from-vgdb-to-31756-by-root-on-???
==31756== could not unlink /tmp/vgdb-pipe-to-vgdb-from-31756-by-root-on-???
==31756== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-31756-by-root-on-???
Segmentation fault

@mfundul
Copy link
Contributor

mfundul commented May 14, 2019

@vlvkobal by looking at git it seems like g18336910 does not contain the fix:

commit 1833691018fda9eb6b80eed373c482a394a1267e
Author: Vladimir Kobal <vlad@prokk.net>
Date:   Tue May 14 13:48:20 2019 +0300

    Fix Coverity defects (#6008)

, it should have been commit g0d5fa83e224afa6d530ed6a5e6e6fe35a8af955e:

commit 0d5fa83e224afa6d530ed6a5e6e6fe35a8af955e (upstream/master, upstream/HEAD, mas
ter)
Author: Vladimir Kobal <vlad@prokk.net>
Date:   Tue May 14 18:35:12 2019 +0300

    Fix segmentation fault (#6011)
    
    * Fix segmentation fault
    
    * Make system info printing safe
    
    * Fix quotes for OS name

@vlvkobal
Copy link
Contributor

Yes. My bad. Rechecked the version. The last one should be v1.14.0-52-g0d5fa83e. @justinmayer please update your netdata to include the last commit.

@netdatabot netdatabot added bug needs triage Issues which need to be manually labelled labels May 14, 2019
@agross
Copy link

agross commented May 14, 2019

The latest commit fixed the segfault for me. Thanks!

@vlvkobal vlvkobal added area/external/python priority/high Super important issue area/streaming and removed needs triage Issues which need to be manually labelled area/external/python labels May 15, 2019
@justinmayer
Copy link
Author

I updated NetData to include the latest commit, which did indeed fix the segfault. Thanks to everyone for such a rapid and responsive exchange here. Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants