Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus poller not handling remote storage failures elegantly #4159

Closed
sevagh opened this Issue May 11, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@sevagh
Copy link

sevagh commented May 11, 2018

Bug Report

What did you do?

I tried to insert Prometheus metrics into Elasticsearch using a remote storage adapter:

remote_write:
  - url: "http://0.0.0.0:9201/write"
remote_read:
  - url: "http://0.0.0.0:9201/read"

Unfortunately the adapter is buggy, and when it crashes, Prometheus is outputting these errors:

May 11 09:36:26 bloop prometheus[4310]: level=warn ts=2018-05-11T16:36:26.434681726Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=100 err="context deadline exceeded

At this point, I try to disable the remote storage config and send a HUP to Prometheus to reload the config without the remote_ configs. However, it stays stuck in a bad state until SIGKILL.

systemctl stop prometheus2, systemctl restart prometheus2, nothing helps at this point except a SIGKILL.

What did you expect to see?

I expected Prometheus to disregard the broken remote storage adapter and continue being responsive to HUPs, TERMs, etc.

What did you see instead? Under which circumstances?

Environment

  • System information:

Linux 4.9.0-4-amd64 x86_64

  • Prometheus version:
prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2
  • Prometheus configuration file:
rule_files:
  - /etc/prometheus2/rules.local/*.yml

remote_write:
 - url: "http://0.0.0.0:9201/write"
remote_read:
 - url: "http://0.0.0.0:9201/read"

So obviously I need to ensure that my adapter is working correctly. But every time I test it, I have to SIGKILL Prometheus - is that expected, desireable, normal?

@sevagh sevagh changed the title Prometheus poller not handling remote_write/remote_read adapter outages Prometheus poller not handling remote storage failures elegantly May 11, 2018

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented May 11, 2018

Currently investigating if it's related to #3941 ([BUGFIX] Correctly stop timer in remote-write path) by upgrading to Prometheus 2.2.1

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented May 11, 2018

This may also be relevant: #2972

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented May 15, 2018

I tested 2.2.1 with the same behavior - if the adapter goes bad, Prometheus goes downhill also. The shards increase and the queued remote climb high.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 28, 2018

@vishksaj

This comment has been minimized.

Copy link

vishksaj commented May 29, 2018

+1
Same issue happens with my prom instance.
Error sending samples to remote storage" count=100 err="context deadline exceeded
Prometheus version : 2.1.0
BuildUser : root@6e784304d3ff
BuildDate : 20180119-12:01:23
GoVersion : go1.9.2
Writing to data to influx via adapter.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented May 29, 2018

Will be fixed by #4187

@tomwilkie tomwilkie closed this May 29, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.