Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upProm2: SIGHUP for config reload causes artifacts in TSDB storage #2756
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for figuring out the correlation. However, this puzzles me a fair bit as the storage is purely configured through flags and a reload does not touch it in any way. That only really leaves either the retrieval our the rule evaluation layer doing something different than 1.x. Given that this is not a recording rule metric, it can only be retrieval. But still cannot think of anything that changed that could cause this. |
This comment has been minimized.
This comment has been minimized.
|
Maybe (wild ass guess) there's a subtle interplay between the timestamp of
it being stored and the delay in retrieval confusing rate?
Interestingly a side by side P1 is happy during these so it's definitely a
change in dev 2.
…On Mon, 22 May 2017, 20:17 Fabian Reinartz, ***@***.***> wrote:
Thanks for figuring out the correlation. However, this puzzles me a fair
bit as the storage is purely configured through flags and a reload does not
touch it in any way. That only really leaves either the retrieval our the
rule evaluation layer doing something different than 1.x.
Given that this is not a recording rule metric, it can only be retrieval.
But still cannot think of anything that changed that could cause this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2756 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJNWo1YS7IYIcl1zKLRGmrdX8nzVaRUMks5r8d8vgaJpZM4Nibz8>
.
|
This comment has been minimized.
This comment has been minimized.
|
Can you provide a dump of the JSON response. Once of the raw metric and
once with rate() applied?
Ideally at the most coarse interval that still shows the problem and
annotation of the time that show it in the graph.
On Mon, May 22, 2017 at 9:26 PM Michal Witkowski <notifications@github.com>
wrote:
… Maybe (wild ass guess) there's a subtle interplay between the timestamp of
it being stored and the delay in retrieval confusing rate?
Interestingly a side by side P1 is happy during these so it's definitely a
change in dev 2.
On Mon, 22 May 2017, 20:17 Fabian Reinartz, ***@***.***>
wrote:
> Thanks for figuring out the correlation. However, this puzzles me a fair
> bit as the storage is purely configured through flags and a reload does
not
> touch it in any way. That only really leaves either the retrieval our the
> rule evaluation layer doing something different than 1.x.
>
> Given that this is not a recording rule metric, it can only be retrieval.
> But still cannot think of anything that changed that could cause this.
>
> —
> You are receiving this because you authored the thread.
>
>
> Reply to this email directly, view it on GitHub
> <
#2756 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AJNWo1YS7IYIcl1zKLRGmrdX8nzVaRUMks5r8d8vgaJpZM4Nibz8
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2756 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEuA8rTmodEH7BH7xFfayWbc6FrNACEuks5r8eFfgaJpZM4Nibz8>
.
|
This comment has been minimized.
This comment has been minimized.
|
The JSON responses for both the |
This comment has been minimized.
This comment has been minimized.
|
Having a bit of trouble here. The unaggregated window in the image is a bit to narrow. |
This comment has been minimized.
This comment has been minimized.
|
Sorry I gave the wrong time range for above. Here's a screenshot and data for another instance with the graphs alined the bad |
This comment has been minimized.
This comment has been minimized.
|
Thanks, basically shows what one expects but no hints as to what's going wrong. Is there any chance that this just happens in phase with Prometheus reloads in your 2.0 for some reason and 1.x is lucky enough to avoid it. |
This comment has been minimized.
This comment has been minimized.
Not sure what you mean here. Both P1 stack and P2 stacks use exactly the same SIGHUP mechanism, reloading exactly the same configs. I can't find a single P1 hiccup, and P2 has it on every single SIGHUP. At the same time, neither P2 or P1 reload. |
brian-brazil
added
the
dev-2.0
label
May 25, 2017
This comment has been minimized.
This comment has been minimized.
|
@fabxc Is there anything else I can help with to repro? We just shipped the latest |
This comment has been minimized.
This comment has been minimized.
|
Current dev-2.0 head does not have #2774 merged, so expect potential trouble. Not sure what else can help reproducing this =/ |
brian-brazil
referenced this issue
Jul 14, 2017
Closed
Handle in-progress scrapes more gracefully on config reload #2336
brian-brazil
added
kind/enhancement
component/config
priority/P2
labels
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
@mwitkow @Bplotka do you still encounter this with 2.0? |
This comment has been minimized.
This comment has been minimized.
|
We will test the new 2.0's SIGHUP case soon and will let you know (: |
This comment has been minimized.
This comment has been minimized.
|
@mwitkow @Bplotka Any updates? |
This comment has been minimized.
This comment has been minimized.
|
Should be fixed by: #2830 |
This comment has been minimized.
This comment has been minimized.
|
Hello, I spent some time today to verify it and I can still reproduce this. I was testing against Prometheus 2.0 built from Every time I send SIGHUP signal we seem to drop some samples (scraping restarted?). Thanks of our new Thanos project I was able to spawn Prometheus in 2-replica HA and reloaded config on just first one. Below screen shows a rate over some counter that has artifacts, but only for one replica that was signaled. Thanks to the Thanos' global querier, we can see it on same the graph: Red line: Prometheus replica 1, after some ~9 SIGHUP-s Below some graphs from the Prometheus UI on replica 1 (one that was signaled a lot) and its Beside the Prometheus version this setup is similar to what we are using heavily with Prometheus 1.8 and we did not see that issue. So #2830 will certainly help for some use cases, but when you certainly have config changes, we will still hit that. (: |
This comment has been minimized.
This comment has been minimized.
|
From the other hand I think we are not longer blocked on this, having HA Prometheus (using Thanos), our roll out procedure (especially in terms of configuration) can be simplified. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc asked me to show table view of
We can see the switch (scrape loop restart?) between:
My scrape interval is 15s but here we scraped something quicker: after only |
This comment has been minimized.
This comment has been minimized.
|
@Bplotka the discovery/scraping is completely refactored so now the scrape pool is never restarted.Just few commits after the ones you tried. Can you try this build(linux,amd64) and ping me on IRC if you find any other bugs. |
krasi-georgiev
referenced this issue
Dec 22, 2017
Closed
reload a scrape pool when detected a config change #3610
This was referenced Jan 10, 2018
This comment has been minimized.
This comment has been minimized.
|
Closed in: #3698 |
gouthamve
closed this
Jan 19, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |





mwitkow commentedMay 22, 2017
What did you do?
Run P2 from dev-2.0 branch side by side our P1 stack with the same config reloading and scraped targets.
What did you expect to see?
A stable retention similarly to P1.
What did you see instead? Under which circumstances?
A periodic, every 1h weird whole in our data, which massively crops up in rates.
This coincides with our config syncing job that sends a SIGHUP to prometheus at least every hour. The data loss corelates exactly with the SIHGUP log statement and with the prometheus_config_last_reload_success_timestamp
@fabxc tracked down the issue