Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus hangs and generates too much ps processes #4938

Closed
Hacky-DH opened this Issue Dec 1, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@Hacky-DH
Copy link

Hacky-DH commented Dec 1, 2018

The Prometheus main process hangs

cat /proc/170606/stat   
170606 (prometheus) ...
cat /proc/170606/stack  
[<ffffffff81333fa8>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff816b3a0c>] __do_page_fault+0x37c/0x450
[<ffffffff816b3b15>] do_page_fault+0x35/0x90
[<ffffffff816af8f8>] page_fault+0x28/0x30
[<ffffffff810362d3>] save_xstate_sig+0x123/0x1c0
[<ffffffff8102a869>] do_signal+0x469/0x6c0
[<ffffffff8102ab1f>] do_notify_resume+0x5f/0xb0
[<ffffffff816b8d37>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

kill -9 not work!!!
and it generates many ps processes, on my host has 5912 ps processes, the process stat is D, and also can't kill by kill -9

admin    168889 168887 D    ps -ef
admin    168896 168895 D    ps axfww
admin    168929 168928 D    ps axfww
admin    168966 168932 D    ps -ef
admin    169212 169211 D    ps -ef
admin    169222 169221 D    ps axfww
adsop    169351 169350 D    ps -aux
admin    169362 169361 D    ps axfww
admin    169369 169368 D    ps -ef

this cause high load

how to fix this? and is there a bug in Prometheus ?

@Hacky-DH Hacky-DH changed the title Prometheus hangs and generates many ps processes Prometheus hangs and generates too much ps processes Dec 1, 2018

@hoffie

This comment has been minimized.

Copy link

hoffie commented Dec 1, 2018

Is this system still hanging? Can you please provide some more context like distribution, kernel version, Prometheus version, dmesg? Can you reproduce this issue? If you have kernel crashdumps configured, it may help to generate one by deliberately crashing the system so that someone (e.g. your OS vendor) can analyze this (for this step, maybe wait until one of the maintainers has commented; keep in mind that crashdumps contain arbitrary system RAM and therefore may contain passwords or other secrets).

Regarding the generation of ps processes: I don't think prometheus ever starts such processes, so would rather suspect a human or some other process to execute them.

I recently had a similar case where all tools which accessed a specific pid directory in /proc got into a hung state (including process_exporter, ps, top, cat /proc/SOMEPID/cmdline). In my experience, the high load in such cases is caused by the kernel task queue which will only grow due to the blocking. I would also expect to see hung task messages in dmesg.

Don't know what the maintainers think, but I would assume that this is no issue which can be caused by userspace (i.e. Prometheus) alone and you may have to get some help from your OS vendor instead.

@Hacky-DH

This comment has been minimized.

Copy link
Author

Hacky-DH commented Dec 3, 2018

Thanks for your reply.

os: centos 7.4,
kernel version: 3.10.0-693.17.1.el7.x86_64
Prometheus version: 2.2.1

The system halt, and not boot so far. dmesg has't any information from Prometheus. And there is no crash messages.

we suspect that Prometheus generates many ps processes because all the ps processes open a file fd=6
6 -> /proc/170606/cmdline , 170606 is Prometheus's pid.

The other information is that Prometheus use a storage volume in k8s. Prometheus probably stunk when volume is unavailable.

do you have any ideas?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Dec 3, 2018

Huh, I would wonder if kubernetes could be spawning those. /proc/<pid>/cmdline has the path to the binary of the process which AFAICS Prometheus never accesses. I'd tend to agree with @hoffie that this is unlikely to be a problem caused by Prometheus itself.

@Hacky-DH

This comment has been minimized.

Copy link
Author

Hacky-DH commented Dec 8, 2018

The reason is that storage volume in k8s is unavailable, all ops access volume will hang.

@Hacky-DH Hacky-DH closed this Jan 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.