Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram incompletely pulled by federation request #1887

Closed
ncabatoffims opened this Issue Aug 11, 2016 · 11 comments

Comments

Projects
None yet
4 participants
@ncabatoffims
Copy link

ncabatoffims commented Aug 11, 2016

What did you do?

Set up two prometheus instances A and B, B federates metrics from A. Have an exporter of a histogram metric be scraped by A. Graph that metric on B using histogram_quantile.

What did you expect to see?

A valid representation of the application behaviour being tracked by the metric.

What did you see instead? Under which circumstances?

Some points on the graph were very high: most points were <0.5s, but a handful were at 25s (the biggest non-+Inf bucket). When I ran the same query on A, those 25s points weren't present. Digging in on the consoles for A and B I found that at the time of one of the 25s blips, asking for a range vector on the bucket yielded a different number of samples for the +Inf bucket than all the others.

Server A:
ims_hessian_request_duration_seconds_bucket{le="25"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591
ims_hessian_request_duration_seconds_bucket{le="+Inf"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591

Server B:
ims_hessian_request_duration_seconds_bucket{le="25"}
102986 @ 1470941522.591
102991 @ 1470941552.591
102994 @ 1470941582.591
102998 @ 1470941612.591
ims_hessian_request_duration_seconds_bucket{le="+Inf"}
102991 @ 1470941552.591
102998 @ 1470941612.591

Environment

  • System information:

Server A: Linux 3.10.0-229.4.2.el7.x86_64 x86_64
Server B: Linux 4.6.4-301.fc24.x86_64 x86_64

  • Prometheus version:

A and B:

prometheus, version 1.0.1 (branch: master, revision: 5bbd31a)
 build user:       mockbuild@darkstar
 build date:       20160722-13:51:56
  go version:       go1.6.2
  • Prometheus configuration file:

I was simplifying the situation a little bit in my description above. In fact, there are two levels of federation; server A is itself pulling metrics from other Prometheus instances, one of which was the one that collected the histogram from a Java process. Ultimately my concern is that the histogram was incomplete: I would expect that if a Prometheus instance contains a histogram sample at time T, when another Prometheus instance does a federation request, it would get the entire histogram sample set for time T, rather than just some of the buckets.

Server A (Prometheus runs on port 9191):

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9191']

  - job_name: 'federate'
    honor_labels: false
    metrics_path: '/federate'

    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"(jvm|process).+"}'
        - 'ims_hessian_request_duration_seconds_bucket'
        - 'ims_hessian_request_duration_seconds_count'
        - 'ims_hessian_request_duration_seconds_sum'
        # etc...

    static_configs:
      - targets:
        - 'server1:9090'
        # etc...

Server B (Prometheus runs on port 9292):

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
        - targets: ['localhost:9292']

  - job_name: 'federate'
    metrics_path: '/federate'
    honor_labels: true
    params:
      'match[]':
        - '{__name__=~".+"}'
    static_configs:
        - targets: ['serverA:9191', ]
          labels:
            'client': 'A'
  • Logs:

I found nothing other than the usual Checkpoint logs.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 11, 2016

This is a race condition in the scraping, and part of why pulling all data from one Prometheus to another (or trying to emulate push with the pushgateway) is a bad idea as artifacts like this are to be expected. It's nothing to do with histograms per-se.

Federation is intended for moving around aggregated stats, if you use it that way then this is far less likely to come up.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 11, 2016

@brian-brazil For my understanding of how this can happen that one bucket got all timestamps transferred, while another one was missing two, is it that the series for those buckets got new samples in the middle of a federation scrape and thus part of the histogram was already exported at the new timestamp, so from the view of the pulling Prometheus, some timestamps were skipped?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 11, 2016

Yes, I believe that's what happened. There's a chance that mostly-atomic scrapes may be possible after #398, but that's in no way certain.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 11, 2016

Yeah, no current way to fix this without blocking incoming samples during federation scrapes, which is not feasible. I guess we'll have to close this unfortunately and live with the occasional blips. Sorry!

@juliusv juliusv closed this Aug 11, 2016

@ncabatoffims

This comment has been minimized.

Copy link
Author

ncabatoffims commented Aug 11, 2016

If it were an occasional blip I wouldn't worry, but it's happening constantly at one site I'm looking at. I guess it's because the sample collection frequency is the same as the federation frequency and they're perfectly lined up to race. Perhaps I should experiment with choosing frequencies for each end that are less likely to align (like relatively prime numbers.)

This is a race condition in the scraping, and part of why pulling all data from one Prometheus to another (or trying to emulate push with the pushgateway) is a bad idea as artifacts like this are to be expected.

Maybe https://prometheus.io/docs/operating/federation/ should be updated to document this?

It's nothing to do with histograms per-se.

That may be so, but from my point of view it's only particularly problematic in their case. The service I'm monitoring is behaving wonderfully, serving all requests in <0.5s, but from looking at Prometheus metrics it looks like it's serving up several 25s requests per minute. This makes it effectively useless for monitoring service times for us unless we can find a workaround, because it's full of false alarms.

Federation is intended for moving around aggregated stats, if you use it that way then this is far less likely to come up.

Can you elaborate? I don't really see how aggregation would reduce the likelihood, unless the aggregation eliminates histograms.

Yeah, no current way to fix this without blocking incoming samples during federation scrapes, which is not feasible.

I haven't looked at the code so forgive my ignorance. Since I believe the pain this problem causes is mostly related to histograms, could we target a fix in that vein? For example, return to a federate request all of the samples from the metrics that make up a histogram or none of them?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 11, 2016

Maybe https://prometheus.io/docs/operating/federation/ should be updated to document this?

It already talks about aggregation.

Can you elaborate? I don't really see how aggregation would reduce the likelihood, unless the aggregation eliminates histograms.

The idea is to push down the calculation as far down the stack of Prometheus servers as you can. Doing rates based on federated data is not a good idea, and I'd avoid using graphs based on federated data unless there was no other choice (e.g. global aggregations) preferring instead to go to the Prometheus that has the raw data.

I haven't looked at the code so forgive my ignorance. Since I believe the pain this problem causes is mostly related to histograms, could we target a fix in that vein?

This is not related to histograms, that's just where you noticed it.

For example, return to a federate request all of the samples from the metrics that make up a histogram or none of them?

The data model Prometheus uses internally has no notion of the existence of histograms, they're just a convention.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 11, 2016

This is a race condition in the scraping, and part of why pulling all data from one Prometheus to another (or trying to emulate push with the pushgateway) is a bad idea as artifacts like this are to be expected.

Maybe https://prometheus.io/docs/operating/federation/ should be updated to document this?

Good idea. I filed prometheus/docs#514.

It's nothing to do with histograms per-se.

That may be so, but from my point of view it's only particularly problematic in their case.

You can run into similar problems with other metric types. For example:

irate(errors[1m] / irate(total[1m])

If the error counter is a scrape interval ahead of the total, it could get even larger than the total, giving you error ratios larger than 1. In any case, the ratio will be wrong.

Federation is intended for moving around aggregated stats, if you use it that way then this is far less likely to come up.

Can you elaborate? I don't really see how aggregation would reduce the likelihood, unless the aggregation eliminates histograms.

At least if you pre-aggregate multiple metrics into a single one and then only federate that, there is no room for having two divergent metrics that you are working with. This inconsistency you are seeing can only happen when you try to compute a result based on metrics from the same target, but from different scrapes of that same target, as then numbers which should always be (at least roughly) consistent with each other (histogram bucket counters, errors/totals, etc.) aren't.

For example, return to a federate request all of the samples from the metrics that make up a histogram or none of them?

As @brian-brazil said, the storage that spews out the federated metrics actually has no notion of what a histogram is anymore - it's just a bunch of time series that happens to have the le label (which the histogram_quantile() interprets in a certain way).

@ncabatoffims

This comment has been minimized.

Copy link
Author

ncabatoffims commented Aug 11, 2016

Ok, I guess we'll have to push the calculations down the stack as you say. Thanks for taking the time to explain the details.

It's a shame because it makes it considerably less flexible for us - dashboard authors won't be able to e.g. implement a different percentile option on the fly, they'll have to make a change request and wait for it to be deployed. Sadly firewall and upstream provisioning issues make it impractical for us to eliminate the federation altogether, though I may try to eliminate one of the levels of federation based on what I've learned today.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 15, 2016

Perhaps this is solvable. I'll add my idea to prometheus/docs#514 . If it is feasible, we can open a new issue here in the Prometheus repo.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 28, 2016

One way to mitigate this might be to have histogram_quantile use the value from the previous bucket if the value in the current bucket is decreasing.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.