Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdata is not killing nvidia-smi process on exit #7143

Closed
vobruba-martin opened this issue Oct 21, 2019 · 14 comments · Fixed by #7372
Closed

Netdata is not killing nvidia-smi process on exit #7143

vobruba-martin opened this issue Oct 21, 2019 · 14 comments · Fixed by #7372
Assignees
Labels
area/collectors Everything related to data collection bug collectors/python.d
Milestone

Comments

@vobruba-martin
Copy link
Contributor

Bug report summary

If I shut down netdata I see that nvidia-smi process is not killed. Number of running nvidia-smi processes keeps growing if I do several netdata restarts.

OS / Environment

Ubuntu 12.04

Netdata version (ouput of netdata -V)

netdata v1.18.0-44-nightly

Component Name

python.d/nvidia_smi

Steps To Reproduce

Enable nvidia-smi module. Start&Stop netdata.

Expected behavior

nvidia-smi process should not be running.

@vobruba-martin vobruba-martin added bug needs triage Issues which need to be manually labelled labels Oct 21, 2019
@ilyam8 ilyam8 added area/external/python and removed needs triage Issues which need to be manually labelled labels Oct 21, 2019
@ilyam8 ilyam8 self-assigned this Oct 21, 2019
@ilyam8 ilyam8 added this to the v1.19-Sprint4 milestone Nov 21, 2019
@ilyam8
Copy link
Member

ilyam8 commented Nov 26, 2019

I cant reproduce it. I cant check it on a real pc - ubuntu 12 is too old.

What i did:

  • installed ubuntu 12.04.5 on a VM
  • installed netdata 1.18.1
  • made fake nvidia smi binary. It produces exactly same output as real one.

I see nvidia-smi charts and i see it is running

root@ubuntu12:/opt/netdata/usr/libexec/netdata/plugins.d# ps faxu | grep netdata
root     28549  0.0  0.0   9396   948 pts/0    S+   19:00   0:00                      \_ grep --color=auto netdata
netdata  28203  0.5  1.8  65396 18704 ?        SNl  18:54   0:02 /opt/netdata/bin/srv/netdata -P /opt/netdata/var/run/netdata/netdata.pid
netdata  28236  0.0  0.0   1620   964 ?        SN   18:54   0:00  \_ bash /opt/netdata/usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
root     28238  0.6  0.1   3812  1428 ?        SN   18:54   0:02  \_ /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin 1
netdata  28244  1.2  2.6 210504 26492 ?        SNl  18:54   0:04  \_ /usr/bin/python /opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata  28501  0.0  0.6 102756  6848 ?        SNl  18:54   0:00  |   \_ /home/test/fake-nvidia-smi -x -q -l 1
netdata  28245  0.0  1.3 122932 13924 ?        SNl  18:54   0:00  \_ /opt/netdata/usr/libexec/netdata/plugins.d/go.d.plugin 1
netdata  28244  1.2  2.6 210504 26492 ?        SNl  18:54   0:04  \_ /usr/bin/python /opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata  28501  0.0  0.6 102756  6848 ?        SNl  18:54   0:00  |   \_ /home/test/fake-nvidia-smi -x -q -l 1

Then i do

root@ubuntu12:/home/test# service netdata restart
 * Stopping real-time performance monitoring netdata                                                                                        [ OK ] 
 * Starting real-time performance monitoring netdata                                                                                               2019-11-26 19:02:52: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
                                                                                                                                            [ OK ]

And check again

root@ubuntu12:/home/test# ps faxu | grep netdata
root     29181  0.0  0.0   9396   948 pts/1    S+   19:03   0:00                          \_ grep --color=auto netdata
netdata  28863  1.0  1.7  63764 18084 ?        SNl  19:02   0:00 /opt/netdata/bin/srv/netdata -P /opt/netdata/var/run/netdata/netdata.pid
netdata  28897  0.0  0.0   1620   944 ?        SN   19:02   0:00  \_ bash /opt/netdata/usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
root     28898  0.5  0.1   3924  1684 ?        SN   19:02   0:00  \_ /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin 1
netdata  28904  2.3  2.4 210504 24656 ?        SNl  19:02   0:00  \_ /usr/bin/python /opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata  29161  0.0  0.4 102756  4832 ?        SNl  19:02   0:00  |   \_ /home/test/fake-nvidia-smi -x -q -l 1
netdata  28905  0.0  1.3 122932 13920 ?        SNl  19:02   0:00  \_ /opt/netdata/usr/libexec/netdata/plugins.d/go.d.plugin 1

I see no zombie process, fake-nvidia-smi is killed

Killed by SIGPIPE

write(1, "<?xml version=\"1.0\" ?>\n<!DOCTYPE"..., 27157) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigreturn(0xc000000180)              = -1 EPIPE (Broken pipe)
futex(0x565ad0, FUTEX_WAKE_PRIVATE, 1)  = 1
rt_sigprocmask(SIG_UNBLOCK, [PIPE], NULL, 8) = 0
getpid()                                = 28501
gettid()                                = 28501
tgkill(28501, 28501, SIGPIPE)           = 0
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigreturn(0xd)                       = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
rt_sigaction(SIGPIPE, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x4555e0}, NULL, 8) = 0
getpid()                                = 28501
gettid()                                = 28501
tgkill(28501, 28501, SIGPIPE)           = 0
--- SIGPIPE (Broken pipe) @ 0 (0) ---
Process 28501 detached

@vobruba-martin you are saying that nvidia-smi thing is not killed after service netdata restafrt, right?

@vobruba-martin
Copy link
Contributor Author

The nvidia-smi process is not killed after service netdata stop. It takes a while for me before netdata is completely shut down.

Before shut down:

# pstree -ps 12558
init(1)───netdata(12127)───python(12181)───nvidia-smi(12558)

Immediately after calling service netdata stop:

# pstree -ps 12558
init(1)───netdata(12127)───python(12181)───nvidia-smi(12558)

After some time when there is no netdata process:

# pstree -ps 12558
init(1)───nvidia-smi(12558)

@ilyam8
Copy link
Member

ilyam8 commented Nov 26, 2019

Does it exit if you do?

kill -SIGPIPE $(pidof nvidia-smi)

@mfundul
Copy link
Contributor

mfundul commented Nov 26, 2019

Can you kill those nvidia-smi processes that are left over? Are they zombie processes? Are they unkillable?

What is your hardware exactly? Is it possible that this is a hardware issue?

Maybe upgrading your GPU driver solves this issue. There are reports in the internet of unkillable nvidia-smi processes from various tools.

@ilyam8
Copy link
Member

ilyam8 commented Nov 26, 2019

@vobruba-martin please check #7372, the problem should be fixed (or not)

  • git clone https://github.com/ilyam8/netdata --branch nvidia_smi_not_loop_mode netdata_nvidia_smi

  • cd netdata_nvidia_smi

And install it.

@ilyam8
Copy link
Member

ilyam8 commented Nov 26, 2019

@vobruba-martin

try to execute

nvidia-smi -x -q

if it hangs, then there is no reason to try #7372

@vobruba-martin
Copy link
Contributor Author

@ilyam8

Does it exit if you do?

kill -SIGPIPE $(pidof nvidia-smi)

No!

try to execute

nvidia-smi -x -q

if it hangs, then there is no reason to try #7372

It doesn't hang.

please check #7372, the problem should be fixed (or not)

Will try it later if the info above will not help.

@mfundul

Can you kill those nvidia-smi processes that are left over? Are they zombie processes? Are they unkillable?

Yes, I can kill them with kill $pid. They doesn't seem to be zombie processes:

# ps aux | grep nvidia-smi
netdata  39186 16.3  0.0  17584  4760 ?        S    Nov26 160:03 /usr/bin/nvidia-smi -x -q -l 1

What is your hardware exactly? Is it possible that this is a hardware issue?

HP ProLiant DL380p Gen8 Server with NVIDIA Tesla T4 GPU.

@ilyam8
Copy link
Member

ilyam8 commented Nov 27, 2019

@vobruba-martin

Just modify your nvidia_smi.chart.py and check it, it will be faster then git clone way. It will fix the problem 100%.

--- collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py	(date 1574785782000)
+++ collectors/python.d.plugin/nvidia_smi/nvidia_smi.chart.py	(date 1574846528946)
@@ -346,14 +346,7 @@
         self.poller = NvidiaSMIPoller(poll)
 
     def get_data(self):
-        if not self.poller.is_started():
-            self.poller.start()
-
-        if not self.poller.is_alive():
-            self.debug('poller is off')
-            return None
-
-        last_data = self.poller.data()
+        last_data = self.poller.run_once()
         if not last_data:
             return None
 

@vobruba-martin
Copy link
Contributor Author

@ilyam8 Yes, the problem is fixed with this change.

@ilyam8
Copy link
Member

ilyam8 commented Nov 27, 2019

@vobruba-martin i made it configurable, you need to off loop mode in the config file after #7372 is merged.

loop_mode: no

@vobruba-martin
Copy link
Contributor Author

I've found one disadvantage of loop_mode: no. My syslog is every second spammed with these messages since then:

Nov 27 11:18:44 test kernel: [561696.614892] nvidia 0000:0b:00.0: irq 170 for MSI/MSI-X
Nov 27 11:18:44 test kernel: [561696.614904] nvidia 0000:0b:00.0: irq 171 for MSI/MSI-X
Nov 27 11:18:44 test kernel: [561696.614912] nvidia 0000:0b:00.0: irq 172 for MSI/MSI-X
Nov 27 11:18:44 test kernel: [561696.614919] nvidia 0000:0b:00.0: irq 173 for MSI/MSI-X
Nov 27 11:18:44 test kernel: [561696.614927] nvidia 0000:0b:00.0: irq 174 for MSI/MSI-X
Nov 27 11:18:44 test kernel: [561696.614934] nvidia 0000:0b:00.0: irq 175 for MSI/MSI-X

@ilyam8
Copy link
Member

ilyam8 commented Nov 27, 2019

You can filter these messages using syslog filters.

If you put this line in the syslog.conf before redirecting rules it should help. Dont forget to restart rsyslog.

:msg, contains, "for MSI/MSI-X" stop

@ilyam8
Copy link
Member

ilyam8 commented Nov 27, 2019

See

https://www.rsyslog.com/doc/v5-stable/configuration/filters.html

Ok it is not stop, it is

:msg, contains, "for MSI/MSI-X" ~

ubuntu12 rsyslog version is ancient

@vobruba-martin
Copy link
Contributor Author

Thanks!

@ilyam8 ilyam8 added collectors/python.d area/collectors Everything related to data collection and removed area/external/python labels Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection bug collectors/python.d
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants