Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus stop timeout #5397

Open
wlay opened this Issue Mar 22, 2019 · 5 comments

Comments

Projects
None yet
2 participants
@wlay
Copy link

wlay commented Mar 22, 2019

Proposal

Use case. Why is this important?
Prometheus process can't exit gracefully.

“Nice to have” is not a good use case. :)

Bug Report

What did you do?
sudo systemctl stop prometheus.service
What did you expect to see?
The process exit gracefully. And show "See you next time!"
What did you see instead? Under which circumstances?
The process didn't show the message "See you next time!" until it timeout.

Environment
centos

  • System information:

    Linux 3.10.0-514.el7.x86_64 x86_64

  • Prometheus version:

    2.7.1

  • Alertmanager version:

    didn't deploy it.

  • Prometheus configuration file:

insert configuration here
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
prometheus log:
level=warn ts=2019-03-22T08:35:46.127409912Z caller=main.go:464 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2019-03-22T08:35:46.127525054Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-22T08:35:46.127559332Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-03-22T08:35:46.127573097Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-03-22T08:35:46.127571547Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-03-22T08:35:46.127623261Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-03-22T08:35:50.821469229Z caller=manager.go:745 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-22T08:35:50.821504484Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-03-22T08:35:50.821581696Z caller=manager.go:751 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-22T08:35:50.834491455Z caller=queue_manager.go:259 component=remote queue="0:https://xxxx/write?
db=xxxxx&rp=autogen&precision=s" msg="Stopping remote storage..."
level=info ts=2019-03-22T08:35:50.834598573Z caller=queue_manager.go:267 component=remote queue="0:https://xxxx/write?
db=xxxxx&rp=autogen&precision=s" msg="Remote storage stopped."
level=info ts=2019-03-22T08:35:50.858201061Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-22T08:35:50.858226827Z caller=main.go:679 msg="Notifier manager stopped"

/var/log/message:
Mar 22 08:37:16 xxxx systemd: prometheus.service stop-final-sigterm timed out. Killing.
Mar 22 08:37:16 xxxx systemd: Unit prometheus.service entered failed state.
Mar 22 08:37:16 xxxx systemd: Triggering OnFailure= dependencies of prometheus.service.
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 22, 2019

Did you configure any specific TimeoutStopSec value? Have you tried starting and stopping without systemd and confirm that the process never stops?

@wlay

This comment has been minimized.

Copy link
Author

wlay commented Mar 25, 2019

Hi @simonpasquier
I didn't configure the "TimeoutStopSec".
This problem is not repeated every time. Is there any good way to confirm this problem from your point of view?

Description=prometheus
OnFailure=mail_on_failure@%n.service
After=network.target

[Service]
User=eng
Group=eng
Type=simple
LimitNOFILE=65536
WorkingDirectory=/opt/prometheus
ExecStart=/bin/sh -c "/opt/prometheus/prometheus --web.enable-lifecycle --config.file=/opt/prometheus/etc/prometheus.yml --storage.tsdb.path=/data --storage.tsdb.retention.time=1d --web.listen-address=0.0.0.0:9090 >>/var/log/prometheus/prometheus.log 2>&1"
Restart=on-failure
LimitCORE=infinity
Environment=GOTRACEBACK=crash
[Install]
WantedBy=multi-user.target```
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 25, 2019

This problem is not repeated every time.

In that case, I would recommend setting a TimeoutStopSec value large enough to accomodate your setup. You might want to start/stop the process a few times manually (eg without systemd), measure how long it takes to exit effectively and multiply this number by a few.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 29, 2019

@wlay I received a comment from you in my inbox which mentioned that the issue might be with the file SD not stopping properly. Is this correct?
Also what is the filesystem storing the configuration files?

@wlay

This comment has been minimized.

Copy link
Author

wlay commented Apr 2, 2019

@simonpasquier I made a mistake about the file SD not stopping. Please ignore it.
I will continue to focus on this problem. If there is any clue, i will update it here in time.
Very sorry for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.