Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upHistogram incompletely pulled by federation request #1887
Comments
ncabatoffims
referenced this issue
Aug 11, 2016
Closed
[feature] Add PromQL support for querying metrics at an absolute time #1888
This comment has been minimized.
This comment has been minimized.
|
This is a race condition in the scraping, and part of why pulling all data from one Prometheus to another (or trying to emulate push with the pushgateway) is a bad idea as artifacts like this are to be expected. It's nothing to do with histograms per-se. Federation is intended for moving around aggregated stats, if you use it that way then this is far less likely to come up. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil For my understanding of how this can happen that one bucket got all timestamps transferred, while another one was missing two, is it that the series for those buckets got new samples in the middle of a federation scrape and thus part of the histogram was already exported at the new timestamp, so from the view of the pulling Prometheus, some timestamps were skipped? |
This comment has been minimized.
This comment has been minimized.
|
Yes, I believe that's what happened. There's a chance that mostly-atomic scrapes may be possible after #398, but that's in no way certain. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, no current way to fix this without blocking incoming samples during federation scrapes, which is not feasible. I guess we'll have to close this unfortunately and live with the occasional blips. Sorry! |
juliusv
closed this
Aug 11, 2016
This comment has been minimized.
This comment has been minimized.
|
If it were an occasional blip I wouldn't worry, but it's happening constantly at one site I'm looking at. I guess it's because the sample collection frequency is the same as the federation frequency and they're perfectly lined up to race. Perhaps I should experiment with choosing frequencies for each end that are less likely to align (like relatively prime numbers.)
Maybe https://prometheus.io/docs/operating/federation/ should be updated to document this?
That may be so, but from my point of view it's only particularly problematic in their case. The service I'm monitoring is behaving wonderfully, serving all requests in <0.5s, but from looking at Prometheus metrics it looks like it's serving up several 25s requests per minute. This makes it effectively useless for monitoring service times for us unless we can find a workaround, because it's full of false alarms.
Can you elaborate? I don't really see how aggregation would reduce the likelihood, unless the aggregation eliminates histograms.
I haven't looked at the code so forgive my ignorance. Since I believe the pain this problem causes is mostly related to histograms, could we target a fix in that vein? For example, return to a federate request all of the samples from the metrics that make up a histogram or none of them? |
juliusv
referenced this issue
Aug 11, 2016
Open
Document possible race conditions during federation #514
This comment has been minimized.
This comment has been minimized.
It already talks about aggregation.
The idea is to push down the calculation as far down the stack of Prometheus servers as you can. Doing rates based on federated data is not a good idea, and I'd avoid using graphs based on federated data unless there was no other choice (e.g. global aggregations) preferring instead to go to the Prometheus that has the raw data.
This is not related to histograms, that's just where you noticed it.
The data model Prometheus uses internally has no notion of the existence of histograms, they're just a convention. |
This comment has been minimized.
This comment has been minimized.
Good idea. I filed prometheus/docs#514.
You can run into similar problems with other metric types. For example:
If the error counter is a scrape interval ahead of the total, it could get even larger than the total, giving you error ratios larger than
At least if you pre-aggregate multiple metrics into a single one and then only federate that, there is no room for having two divergent metrics that you are working with. This inconsistency you are seeing can only happen when you try to compute a result based on metrics from the same target, but from different scrapes of that same target, as then numbers which should always be (at least roughly) consistent with each other (histogram bucket counters, errors/totals, etc.) aren't.
As @brian-brazil said, the storage that spews out the federated metrics actually has no notion of what a histogram is anymore - it's just a bunch of time series that happens to have the |
This comment has been minimized.
This comment has been minimized.
|
Ok, I guess we'll have to push the calculations down the stack as you say. Thanks for taking the time to explain the details. It's a shame because it makes it considerably less flexible for us - dashboard authors won't be able to e.g. implement a different percentile option on the fly, they'll have to make a change request and wait for it to be deployed. Sadly firewall and upstream provisioning issues make it impractical for us to eliminate the federation altogether, though I may try to eliminate one of the levels of federation based on what I've learned today. |
This comment has been minimized.
This comment has been minimized.
|
Perhaps this is solvable. I'll add my idea to prometheus/docs#514 . If it is feasible, we can open a new issue here in the Prometheus repo. |
beorn7
referenced this issue
Aug 15, 2016
Open
Make rule evaluation and federation consistent in time #1893
This comment has been minimized.
This comment has been minimized.
|
One way to mitigate this might be to have histogram_quantile use the value from the previous bucket if the value in the current bucket is decreasing. |
lmb
referenced this issue
Apr 7, 2017
Closed
Aggregated histogram_quantile produces incorrect, unstable results #2598
jjneely
referenced this issue
Apr 11, 2017
Merged
Force buckets in a histogram to be monotonic for quantile estimation #2610
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
ncabatoffims commentedAug 11, 2016
What did you do?
Set up two prometheus instances A and B, B federates metrics from A. Have an exporter of a histogram metric be scraped by A. Graph that metric on B using histogram_quantile.
What did you expect to see?
A valid representation of the application behaviour being tracked by the metric.
What did you see instead? Under which circumstances?
Some points on the graph were very high: most points were <0.5s, but a handful were at 25s (the biggest non-+Inf bucket). When I ran the same query on A, those 25s points weren't present. Digging in on the consoles for A and B I found that at the time of one of the 25s blips, asking for a range vector on the bucket yielded a different number of samples for the +Inf bucket than all the others.
Server A:
ims_hessian_request_duration_seconds_bucket{le="25"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591
ims_hessian_request_duration_seconds_bucket{le="+Inf"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591
Server B:
ims_hessian_request_duration_seconds_bucket{le="25"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591
ims_hessian_request_duration_seconds_bucket{le="+Inf"}
102991 @ 1470941552.591
102998 @ 1470941612.591
Environment
Server A: Linux 3.10.0-229.4.2.el7.x86_64 x86_64
Server B: Linux 4.6.4-301.fc24.x86_64 x86_64
A and B:
I was simplifying the situation a little bit in my description above. In fact, there are two levels of federation; server A is itself pulling metrics from other Prometheus instances, one of which was the one that collected the histogram from a Java process. Ultimately my concern is that the histogram was incomplete: I would expect that if a Prometheus instance contains a histogram sample at time T, when another Prometheus instance does a federation request, it would get the entire histogram sample set for time T, rather than just some of the buckets.
Server A (Prometheus runs on port 9191):
Server B (Prometheus runs on port 9292):
I found nothing other than the usual Checkpoint logs.