Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement federation (timeseries streaming) #9

Closed
juliusv opened this Issue Jan 4, 2013 · 13 comments

Comments

Projects
None yet
4 participants
@juliusv
Copy link
Member

juliusv commented Jan 4, 2013

It should be possible to efficiently stream timeseries from one Prometheus instance to another, with exchanged series determined based on a federation configuration.

juliusv added a commit that referenced this issue Apr 9, 2014

Merge pull request #9 from matttproud/feature/base-labels-on-registra…
…tion

Add ``baseLabels`` to the registration signature.

matttproud added a commit that referenced this issue Apr 9, 2014

Merge pull request #9 from prometheus/refactor/user-experience/standa…
…rdized-directories

Rearrange file and package per convention.
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 10, 2014

One idea would be to do this via console templates, we could add a function that'd take the output of a query and produce text/protobuf format. We'd need some hook to set the content type too.

@juliusv

This comment has been minimized.

Copy link
Member Author

juliusv commented Dec 10, 2014

@brian-brazil This would certainly be possible, but since this is an integral feature, it arguably deserves its own specialized and optimized implementation and endpoint, no?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 11, 2014

A separate endpoint would be best.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 18, 2015

A way to do this via console templates, until we've got a full-on solution: https://github.com/prometheus/prometheus/blob/master/consoles/federation_template_example.txt

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 15, 2015

The solution might include "streaming" as in "transfer more than one timestamped sample per time series during one scrape by a higher-level Prometheus server of a lower-level Prometheus server".

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 15, 2015

I'm a little wary of doing more than one value. The main reason you'd need that would be if a previous scrape failed, and requesting more data from a server that failed last time may lead to a cascading failure.

@multilinear

This comment has been minimized.

Copy link

multilinear commented May 27, 2015

There are two common use-cases for federation:

  1. Scaling, as folks have mentioned. Given prometheus' scaling this is actually probably the rarer use-case
  2. Aggregating data across zones of some form

It's generally important to monitor a target from "nearby". You want to run prometheus as close to the target in the network sense as possible. It's actually generally a good idea to run it in the same failure domain as well, as then your monitoring goes down exactly when your system goes down, instead of alternate with it, this helps avoid your system being up while your monitoring is down, minimizes netsplits impacting monitoring, etc.

In the case of multiple zones though it's often useful to cross-correlate data across those zones. So you'd use the federation to pull the data in to a "global" level prometheus. In this case it'd be fairly common for a scrape to fail due to a network-level event (fiber cut, router failure, etc.)... and it kind of sucks to just lose that data from your global level prometheus instance when it still exists in the lower level monitoring.

I should note here that in the prometheus model there isn't a global store to pull from, so if the data isn't in that top-level right now, you'll never get it there. You'd end up having to do periodic dumps and imports from your lower-level promethei to fill in holes for network outages... ick :(.

I'd suggest pulling data in in a more "streaming" fashion with a bounding the window. The default bound can be relatively small to avoid the cascading problem, this way it should at least be able to bridge small network "glitches" like those frequently seen in intercontinental links. If someone wants to expose themselves to cascade failures to handle a cruddy network, they could extend the window if desired.

@multilinear

This comment has been minimized.

Copy link

multilinear commented May 27, 2015

Oh, also, this way you can handle high-frequency data without having a high-frequency poll at the federation layer.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2015

I don't think a bounding window is sufficient to prevent cascading failures, even if it now requests at most two data points that means that the load on the slave prometheus server could double in an outage - which would be bad.

My experience is that gaps due to small blips due to network fun don't usually cause problems in practice. I'd try to avoid putting anything critical in a global prometheus, due to the fundamental unreliability of the WAN (and data appearing a bit back in time may cause weirdness with rules) - it's more for general information with the per-cluster/failure domain prometheus servers being the place you usually go to first.

@multilinear

This comment has been minimized.

Copy link

multilinear commented May 27, 2015

What about higher frequency data? It seems the scrapes will have to happen at least as fast as the fastest scrape that the lower-level prometheus is doing. Which, assuming prometheus is as well written as I think it is (I'm new to the community)... could be very very fast.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2015

At the global level, high frequency data is much less useful than at a local level.

High-frequency data (on the order of seconds) is primarily useful for debugging things like microbursts for which you usually want to look at a handful of variables in roughly one datacenter at a time to figure things out, and reduce the impact of the various race conditions inherent in monitoring.

At a global level you tend to want a wide range of metrics at no more than a minute granularity. A well instrumented server will tend to have hundreds to thousands of metrics, and many thousands of time series. Doing scrapes more often will make you run into performance problems sooner without much benefit from the increased frequency, rather it's the breadth of instrumentation that helps you pin down all bar the microbust-level issues. If anything you'd be looking at downsampling a bit at the global level.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 20, 2015

juliusv pushed a commit that referenced this issue Jul 3, 2016

Start bringing it all together (#9)
* Add backoff to consul client
* Add promu target for frank binarty
* Wire up the server and client to the distributor
* Add basic instructions for bring everything up
* Fix various mistakes
* Review feedback

simonpasquier referenced this issue in simonpasquier/prometheus Oct 12, 2017

cofyc added a commit to cofyc/prometheus that referenced this issue Jun 5, 2018

Merge pull request prometheus#9 from cofyc/revert
This revert following commits:

simonpasquier referenced this issue in simonpasquier/prometheus Jul 20, 2018

Merge pull request #9 from simonpasquier/openshift-master
Fix version in Dockerfile.rhel7

bobmshannon pushed a commit to bobmshannon/prometheus that referenced this issue Nov 19, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.