Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prom2: SIGHUP for config reload causes artifacts in TSDB storage #2756

Closed
mwitkow opened this Issue May 22, 2017 · 20 comments

Comments

Projects
None yet
7 participants
@mwitkow
Copy link
Contributor

mwitkow commented May 22, 2017

What did you do?
Run P2 from dev-2.0 branch side by side our P1 stack with the same config reloading and scraped targets.

What did you expect to see?
A stable retention similarly to P1.

What did you see instead? Under which circumstances?

A periodic, every 1h weird whole in our data, which massively crops up in rates.

image

This coincides with our config syncing job that sends a SIGHUP to prometheus at least every hour. The data loss corelates exactly with the SIHGUP log statement and with the prometheus_config_last_reload_success_timestamp

image

@fabxc tracked down the issue

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 22, 2017

Thanks for figuring out the correlation. However, this puzzles me a fair bit as the storage is purely configured through flags and a reload does not touch it in any way. That only really leaves either the retrieval our the rule evaluation layer doing something different than 1.x.

Given that this is not a recording rule metric, it can only be retrieval. But still cannot think of anything that changed that could cause this.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented May 22, 2017

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 22, 2017

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented May 23, 2017

simleaf_bad_timing

The JSON responses for both the rate and unrated data (as seen in the graph) is included below.
dropped_hup_bad_data.tar.gz

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 23, 2017

Having a bit of trouble here. The unaggregated window in the image is a bit to narrow.
Does the unaggregated data also have artifacts or not? Which window does the raw data cover? Around the artifact or just right until it occurs?

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented May 23, 2017

Sorry I gave the wrong time range for above. Here's a screenshot and data for another instance with the graphs alined the bad rate. You can see a little dent in the raw data as well above.

dropped_hup_bad_data2.tar.gz.tar.gz
hub_bad_timing2

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 24, 2017

Thanks, basically shows what one expects but no hints as to what's going wrong.
The good news is, that I can reproduce it locally to some degree. The bad news, that I see it equally for Prometheus 1 and 2.

Is there any chance that this just happens in phase with Prometheus reloads in your 2.0 for some reason and 1.x is lucky enough to avoid it.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented May 24, 2017

Is there any chance that this just happens in phase with Prometheus reloads in your 2.0 for some reason and 1.x is lucky enough to avoid it.

Not sure what you mean here. Both P1 stack and P2 stacks use exactly the same SIGHUP mechanism, reloading exactly the same configs. I can't find a single P1 hiccup, and P2 has it on every single SIGHUP.

At the same time, neither P2 or P1 reload.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented May 30, 2017

@fabxc Is there anything else I can help with to repro? We just shipped the latest dev-2.0 and will let it bake to see if it fixes the problem, but unlikely.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 30, 2017

Current dev-2.0 head does not have #2774 merged, so expect potential trouble. Not sure what else can help reproducing this =/

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 12, 2017

@mwitkow @Bplotka do you still encounter this with 2.0?

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Nov 13, 2017

We will test the new 2.0's SIGHUP case soon and will let you know (:

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Dec 21, 2017

@mwitkow @Bplotka Any updates?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Dec 21, 2017

Should be fixed by: #2830

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Dec 21, 2017

Hello, I spent some time today to verify it and I can still reproduce this. I was testing against Prometheus 2.0 built from release-2.0 branch (revision b3ddaf5ae9113a31be7f364c3dfee2e412e72df3).

Every time I send SIGHUP signal we seem to drop some samples (scraping restarted?). Thanks of our new Thanos project I was able to spawn Prometheus in 2-replica HA and reloaded config on just first one. Below screen shows a rate over some counter that has artifacts, but only for one replica that was signaled. Thanks to the Thanos' global querier, we can see it on same the graph:

image

Red line: Prometheus replica 1, after some ~9 SIGHUP-s
Green line: Prometheus replica 2, no SIGHUP-s

Below some graphs from the Prometheus UI on replica 1 (one that was signaled a lot) and its prometheus_config_last_reload_success_timestamp_seconds metric:
image
image

Beside the Prometheus version this setup is similar to what we are using heavily with Prometheus 1.8 and we did not see that issue.

So #2830 will certainly help for some use cases, but when you certainly have config changes, we will still hit that. (:

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Dec 21, 2017

From the other hand I think we are not longer blocked on this, having HA Prometheus (using Thanos), our roll out procedure (especially in terms of configuration) can be simplified.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Dec 21, 2017

@fabxc asked me to show table view of grpc_client_handled_total{grpc_code="OK", grpc_method="Series"}[5m] OFFSET X when some SIGHUP occur. The result:

15595957 @1513851364.047
15597461 @1513851379.047
15598993 @1513851394.047
15600499 @1513851409.047
15602000 @1513851424.047
15603511 @1513851439.047
15605036 @1513851454.047
15606539 @1513851469.047
15608057 @1513851484.047
15609580 @1513851499.047
15611093 @1513851514.047
15612625 @1513851529.047
15614131 @1513851544.047
15615631 @1513851559.047
15617145 @1513851574.047
15618647 @1513851589.047
15620155 @1513851604.047
15620792 @1513851611.467
15622307 @1513851626.468
15623804 @1513851641.468
15625272 @1513851656.468

We can see the switch (scrape loop restart?) between:

15620155 @1513851604.047
15620792 @1513851611.467

My scrape interval is 15s but here we scraped something quicker: after only 7s

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 21, 2017

@Bplotka the discovery/scraping is completely refactored so now the scrape pool is never restarted.Just few commits after the ones you tried.
Scrape loops are added for new targets and stopped for the removed ones. Loops for existing targets remain unchanged.

Can you try this build(linux,amd64) and ping me on IRC if you find any other bugs.
https://github.com/krasi-georgiev/prometheus/releases/download/v2.0.0-beta.draft/prometheus
(this one includes PR for another bug I found just before it was merged #3605)

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 19, 2018

Closed in: #3698

@gouthamve gouthamve closed this Jan 19, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.