Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system.active_processes out of control since update #9084

Closed
kmlucy opened this issue May 18, 2020 · 11 comments · Fixed by #9107
Closed

system.active_processes out of control since update #9084

kmlucy opened this issue May 18, 2020 · 11 comments · Fixed by #9107
Assignees
Labels
bug needs triage Issues which need to be manually labelled

Comments

@kmlucy
Copy link
Contributor

kmlucy commented May 18, 2020

Bug Report Summary

I just updated to v1.22.1-51-gb3cfa54b from v.1.21 and now the system.active_processes metric is rising uncontrollably.

In the below image, A is where I upgraded, and B is where I restarted Netdata.

image

Running ps -aux | grep netdata on the host results in:

ps -aux | grep netdata
70         384  2.2  0.0 194072 13700 ?        Ss   12:45   1:34 postgres: netdata postgres 172.24.0.5(58430) idle
kyle     21170  0.0  0.0   3216  2136 pts/0    S+   13:55   0:00 grep --color=auto netdata
root     31927  0.0  0.0 2245736 42044 ?       Ssl  12:44   0:00 /usr/bin/docker start -a netdata
201      32005  8.0  0.6 531560 409784 ?       SNsl 12:44   5:41 /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set web web files group root -W set web web files owner root
201      32107  0.0  0.0  15804   804 ?        SNl  12:44   0:01 /usr/sbin/netdata --special-spawn-server
root     32513 25.0  0.0  56472 53840 ?        RN   12:45  17:35 /usr/libexec/netdata/plugins.d/apps.plugin 1
201      32514  2.7  0.0  41236 28892 ?        SNl  12:45   1:55 /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
201      32520  0.0  0.0 126464 11196 ?        SNl  12:45   0:03 /usr/libexec/netdata/plugins.d/go.d.plugin 1
201      32621  0.2  0.0   2880  2396 ?        SN   12:45   0:08 bash /usr/libexec/netdata/plugins.d/charts.d.plugin 1

Running the same in the container results in:

    1 netdata   5:43 /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set web web files group root -W set web web files owner root
    7 netdata   0:01 /usr/sbin/netdata --special-spawn-server
  222 root     18:01 /usr/libexec/netdata/plugins.d/apps.plugin 1
  223 netdata   1:56 /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
  229 netdata   0:03 /usr/libexec/netdata/plugins.d/go.d.plugin 1
  303 netdata   0:09 bash /usr/libexec/netdata/plugins.d/charts.d.plugin 1
  306 netdata   0:00 [timeout]
  313 netdata   0:00 [timeout]
  320 netdata   0:00 [timeout]
  327 netdata   0:00 [timeout]
...
32703 netdata   0:00 [timeout]
32710 netdata   0:00 [timeout]
32717 netdata   0:00 [timeout]
32724 netdata   0:00 [timeout]
32731 netdata   0:00 [timeout]
32738 netdata   0:00 [timeout]
32745 netdata   0:00 [timeout]
32752 netdata   0:00 [timeout]
32759 netdata   0:00 [timeout]
32766 netdata   0:00 [timeout]

With a total of (currently) 6562 lines, which seems to match the current number of active processes above what I would expect.

OS / Environment

I am running inside the official netdata/netdata Docker image. The information below is from the host server.

/etc/os-release:PRETTY_NAME="Debian GNU/Linux 10 (buster)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="10"
/etc/os-release:VERSION="10 (buster)"
/etc/os-release:VERSION_CODENAME=buster
/etc/os-release:ID=debian
/etc/os-release:HOME_URL="https://www.debian.org/"
/etc/os-release:SUPPORT_URL="https://www.debian.org/support"
/etc/os-release:BUG_REPORT_URL="https://bugs.debian.org/"
Netdata Version

v1.22.1-51-gb3cfa54b

Component Name

proc.plugin

Steps To Reproduce
  1. Start Netdata in Docker container
Expected Behavior

Processes not to be created infinitely

@kmlucy kmlucy added bug needs triage Issues which need to be manually labelled labels May 18, 2020
@ilyam8
Copy link
Member

ilyam8 commented May 18, 2020

looks like same issue as #9070

@kmlucy
Copy link
Contributor Author

kmlucy commented May 18, 2020

It may be. I saw that issue, but as it only had occasional zombie processes, and a max of two, I thought it also might be unrelated.

For now, I'm having to restart my Netdata container every few hours to keep the processes under control.

@ilyam8
Copy link
Member

ilyam8 commented May 18, 2020

For now, I'm having to restart my Netdata container every few hours to keep the processes under control.

In the below image, A is where I upgraded

Have you considered downgrade?

@kmlucy
Copy link
Contributor Author

kmlucy commented May 18, 2020

I could downgrade, but I'm inevitably going to have to run the v1.22 again to troubleshoot. I just put a restart command in my crontab and that way I have a running container from which to pull logs, data, etc.

@mfundul
Copy link
Contributor

mfundul commented May 19, 2020

Can you share var/log/netdata/error.log since agent start?

Edit: this would actually be stderr so you could get it by running docker start -a netdata

@mfundul
Copy link
Contributor

mfundul commented May 19, 2020

@kmlucy can you run ps -eo "pid,ppid,user,args" inside the container?

@ilyam8
Copy link
Member

ilyam8 commented May 19, 2020

[ilyam@ilyam-pc ~]$ docker exec -it netdata ps -eo "pid,ppid,user,args"
PID   PPID  USER     COMMAND
    1     0 netdata  /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set w
    7     1 netdata  /usr/sbin/netdata --special-spawn-server
  129     1 netdata  /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.pl
  152     1 netdata  /usr/libexec/netdata/plugins.d/go.d.plugin 1
  277     1 root     /usr/libexec/netdata/plugins.d/apps.plugin 1
  344     1 netdata  [timeout]
  472     0 root     ps -eo pid,ppid,user,args

@ilyam8
Copy link
Member

ilyam8 commented May 19, 2020

@kmlucy could you disable charts.d.plugin and check if the problem exists? If it doesnt help i would try to disable health.

@kmlucy
Copy link
Contributor Author

kmlucy commented May 19, 2020

@mfundul

#ps -eo "pid,ppid,user,args" | head
PID   PPID  USER     COMMAND
    1     0 netdata  /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set web web files group root -W set web web files owner root
    7     1 netdata  /usr/sbin/netdata --special-spawn-server
  219     1 root     /usr/libexec/netdata/plugins.d/apps.plugin 1
  259     1 netdata  /usr/libexec/netdata/plugins.d/go.d.plugin 1
  270     1 netdata  bash /usr/libexec/netdata/plugins.d/charts.d.plugin 1
  294     1 netdata  /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
  304     1 netdata  [timeout]
  311     1 netdata  [timeout]
  318     1 netdata  [timeout]
# ps -eo "pid,ppid,user,args" | tail
32702     1 netdata  [timeout]
32709     1 netdata  [timeout]
32716     1 netdata  [timeout]
32723     1 netdata  [timeout]
32730     1 netdata  [timeout]
32737     1 netdata  [timeout]
32744     1 netdata  [timeout]
32751     1 netdata  [timeout]
32758     1 netdata  [timeout]
32765     1 netdata  [timeout]

Error logs (stderr): https://termbin.com/h8f8

@ilyam8

Disabling the NUT collector seems to have fixed the issue.

# ps -eo "pid,ppid,user,args"
PID   PPID  USER     COMMAND
    1     0 netdata  /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set web web files group root -W set web web files owner root
    7     1 netdata  /usr/sbin/netdata --special-spawn-server
  231     1 root     /usr/libexec/netdata/plugins.d/apps.plugin 1
  260     1 netdata  /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
  280     1 netdata  /usr/libexec/netdata/plugins.d/go.d.plugin 1
  450     1 netdata  [timeout]
 1129     0 root     /bin/bash -c export TERM=xterm-color;bash
 1134  1129 root     bash
 1422  1134 root     ps -eo pid,ppid,user,args

@mfundul
Copy link
Contributor

mfundul commented May 19, 2020

The takeaway here is that all those processes are under PID 1 and not under PID 7. I'll look into it.

@ilyam8
Copy link
Member

ilyam8 commented May 19, 2020

Disabling the NUT collector seems to have fixed the issue.

charts.d strikes again 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issues which need to be manually labelled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants