Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upGC drops all the series from the head-block every 2 hours #4115
Comments
This comment has been minimized.
This comment has been minimized.
|
this looks normal to me. The in memory head is used to avoid frequent writes to disk. |
This comment has been minimized.
This comment has been minimized.
|
regarding missing metrics you might be hitting this, but I doubt it. |
This comment has been minimized.
This comment has been minimized.
This does not look like a sane setup, and the logs confirm you're doing something odd. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil it's a higher-level Prometheus scraping another one with less frequency |
This comment has been minimized.
This comment has been minimized.
|
are you actually missing any metrics when you run a query? the head gets persisted to disk when it is cleared so the metrics aren't lost. |
This comment has been minimized.
This comment has been minimized.
|
The maxmimum sane interval is 2m, and the logs indicate metrics are clashing. I'd presume that this behaviour is due to your setup, and not an issue with Prometheus. |
This comment has been minimized.
This comment has been minimized.
|
it took me a while to realise the real issue @semyonslepov just had a quick look at the code for the scraping and it seems that if the target in the |
This comment has been minimized.
This comment has been minimized.
|
Got the real data-loss issue with more sane configuration and it again correlates with the GC execution. Configuration:
Log messages on the federating instance:
|
This comment has been minimized.
This comment has been minimized.
|
That's a failed scrape, which is not data loss. See also https://www.robustperception.io/federation-what-is-it-good-for/ |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil how did you figure it is a missed scrape? |
This comment has been minimized.
This comment has been minimized.
|
up is 0. |
This comment has been minimized.
This comment has been minimized.
|
thanks I didn't notice the screenshot untill now. @semyonslepov this looks different than the original issue with the ingestion logs. @semyonslepov did you check if the target exposes the correct timestamps? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev they do |
This comment has been minimized.
This comment has been minimized.
|
I am still a bit curious why this happens. prometheus/vendor/github.com/prometheus/tsdb/head.go Lines 481 to 483 in 5b27996 I just tried the federation and it returns the mos recent ingested sample+timestamp so even if you scrape the main Prometheus server every 2h (like in your original config) you should still be getting a recent timestamp which should be within the current The only explanation would be if the main server that you are federating returns a timestamp that is too old so it cannot be ingested in the current |
This comment has been minimized.
This comment has been minimized.
|
I would imagine that it's an issue for the first configuration with per-1h scrape. However, it still happens for per-30s scrapes. Moreover, in this case, it doesn't happen randomly, it happens at the same time as GC comes. |
This comment has been minimized.
This comment has been minimized.
|
I think dropping at gc is normal as this is when the code verifies that the metrics have the correct timestamp sequence and drops the once that are not within the head time range. |
This comment has been minimized.
This comment has been minimized.
|
@semyonslepov is it behaving as expected after you changed the configs? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I didn't change configuration, there are two different configurations on two different hosts, I just wanted to emphasize that they both have the same issue |
This comment has been minimized.
This comment has been minimized.
|
I am running out of ideas on this one. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev no news, scrapes still fail leaving gaps afterwards (it happens together with |
This comment has been minimized.
This comment has been minimized.
|
hm , any idea how we can troubleshoot this together as with the current details it is a bit of guessing. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I will try to increase scrape timeouts and look how it works, probably it's all about too heavy federation target |
brian-brazil
added
the
kind/more-info-needed
label
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
Timeout increase didn't help (and
And empty result when trying to fetch the test metric's value (we have a monitoring script on it, it asks Prometheus API for known metrics every 5 minutes). |
This comment has been minimized.
This comment has been minimized.
|
maybe if you add some debugging info it will give you a clue what samples are dropped but the output would be quite busy so not sure if it will help
add here prometheus/vendor/github.com/prometheus/tsdb/head.go Lines 480 to 483 in 5b27996 to compile a binary after the change
|
This comment has been minimized.
This comment has been minimized.
|
seems stale, feel free to reopen if you think we should revisit. |


semyonslepov commentedApr 26, 2018
Bug Report
What did you do?
Normal Prometheus operation with ~1000000 series in the head block
What did you expect to see?
Smooth operation with all the data available ~99.99% of time.
Some percentage of time-series being dropped from the head block on every GC execution but not all of them.
What did you see instead? Under which circumstances?
It seems that every 2 hours all the series are dropped from the head block and then restored in ~10 minutes.
This event coincides with the GC execution.
Time-series data from time to time becomes unavailable during this "drop-restore" period.
Relevant metrics:
And it happens every 2 hours:
Environment
Linux 4.9.91-40.57.amzn1.x86_64 x86_64