Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for local storage engine that only supports federation queries #2174

Closed
ncabatoff opened this Issue Nov 8, 2016 · 17 comments

Comments

Projects
None yet
6 participants
@ncabatoff
Copy link

ncabatoff commented Nov 8, 2016

The goal of this enhancement proposal is to reduce the resource usage for Prometheus instances that live only to collect metrics and answer federation queries. The use case is to work around networking issues by using this Prometheus instance as something like a "pull gateway", collecting metrics on the local host and allowing a downstream Prometheus to federate them over a single port, without having to expose all the ports of the underlying metric sources.

The proposal is to add a "federation" option for -storage.local.engine, which would lie in between "persisted" and "none" in terms of functionality. In this mode, indexes would be created as usual and checkpoints performed, but no sample chunks would ever be created. Running in this mode would result in a low-impact "federation" Prometheus instance that can collect metrics locally without consuming much RAM or disk IO, and can expose all the metrics on a single port via the federation API (which doesn't rely on chunk storage.)

One option to do this that exists already is to run with a very small retention period, but with a large number of metrics this makes series maintenance very busy, and it seems wasteful to spend CPU time encoding chunks when we're really only interested in the last sample value.

The firewall issue could instead be addressed via a proxy without involving a Prometheus at all, i.e. allow a single process to proxy metric fetching requests to all the underlying metric sources. The advantage of using Prometheus is that it simplifies some kinds of service discovery in that you don't need to have a global view of the entire system. In my case it's easier to learn what metric sources exist locally.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 8, 2016

If you're not using chunks to store data, then where are you storing it?

A low retention period should do what you want, though it's best to push down your monitoring rather than introducing additional races and artifacts by using federation for everything. It sounds like what you want is a proxy server.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 8, 2016

If you're not using chunks to store data, then where are you storing it?

I think he is referring to no series files. All chunks that are not yet purged are kept in the checkpoint. This makes sense for servers with low retention. If you know you can keep everything in RAM, there is no need to ever write to series files. It should be fairly easy to implement.

@onorua

This comment has been minimized.

Copy link

onorua commented Dec 3, 2016

I second this proposal. I was really amused how federation in prometheus works, or better say doesn't work.
Generally, federation server should hide complexity of underlying servers' topology, and provide single point of contact for 3rd party systems (Grafana in case of prometheus for us). So, I've naively started up federation server with match for all our metrics, and got dead server by the midnight due to throttling. I've made analysis and it seems federation server is caching all the data from all federated servers, which basically doesn't solve issue of scalability at all.
Also, if you have sharding based on hash, and you have like 10 servers "hided" by one federation server, and you have only basic metrics on it, the ones you use for graphs (because of the point 1). But if you need to dig deeply into some problem, you have to go to each and every server... and what is most important, you have no way to use functions such as avg or sum or linear prediction, because each server is completely independent.
Well, you can enable that particular metric to be scraped by federation server as well, but it doesn't scrape the metric for past time, they appear only from the time of enabling it...
Maybe I did not get the documentation, but federation is practically useless for us in the state it is now, and it would be a really great to have some "pull gateway" as @ncabatoff proposed.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 4, 2016

So, I've naively started up federation server with match for all our metrics, and got dead server by the midnight due to throttling

Federation is not intended to dump entire servers. It's intended to provide a limited number of aggregated stats to a higher level of Prometheus server.

But if you need to dig deeply into some problem, you have to go to each and every server

This is correct, and why we advise not using sharding unless you really have to. The vast majority of users do not have the thousands of machines in a single DC that'd justify horizontal sharding.

@mattbostock

This comment has been minimized.

Copy link
Contributor

mattbostock commented Dec 4, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 4, 2016

In that sort of case I'd still advise keeping the data around in a Prometheus for a while, so you can debug with the raw data.

If you never want to use the high cardinality metrics, then you should remove the unwanted labels from instrumentation rather than adding a Prometheus to do so.

@onorua

This comment has been minimized.

Copy link

onorua commented Dec 4, 2016

The vast majority of users do not have the thousands of machines in a single DC that'd justify horizontal sharding.

We've got more than 3.2mil metrics from <400 servers, which was enough to utilize 32GB of RAM and 12 cores and got throttling every 2 hours. I had to use federation just to scale to more than 500 nodes. We are currently doing testing of prometheus on our test cluster, and it is good enough for small scale (<300 nodes) with one node, but not good at all if you need to have horizontal scale.

@mattbostock

This comment has been minimized.

Copy link
Contributor

mattbostock commented Dec 4, 2016

In that sort of case I'd still advise keeping the data around in a Prometheus for a while, so you can debug with the raw data.

Makes sense.

ncabatoff added a commit to ncabatoff/prometheus that referenced this issue Dec 4, 2016

First pass at Issue prometheus#2174: Add support for local storage en…
…gine that only

supports federation queries.

Adds a new option for flag -storage.local.engine: "federated".  When federated
mode is selected, turns on new local storage option DoNotPersistChunks.

This changeset bumps the checkpoint format to 3 in order to add a new
per-series field lastSampleValue, since checkpoints don't include any chunks in
federated mode.  Checkpoint formats 1 and 2 are still supported for reads.
This is the only change that applies both to the "federated" and "persisted"
settings for -storage.local.engine.

All of the following behaviour applies only if DoNotPersistChunks=true:

* Non-federation queries don't work.

* No chunk eviction or throttling (rushed mode).

* No chunks are ever encoded or persisted.  When adding a sample to a series
  we do nothing except update lastTime, savedFirstTime, lastSampleValue,
  lastSampleValueSet.

* During series maintenance, skip all the chunk related stuff, just drop any
  series with lastTime is too old.  Never archive, unarchive, or quarantine
  series if DoNotPersistChunks=true.

* Crash recovery ignores series files.
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 5, 2016

We've got more than 3.2mil metrics from <400 servers, which was enough to utilize 32GB of RAM and 12 cores and got throttling every 2 hours. I had to use federation just to scale to more than 500 nodes. We are currently doing testing of prometheus on our test cluster, and it is good enough for small scale (<300 nodes) with one node, but not good at all if you need to have horizontal scale.

Prometheus is first and foremost not a horizontally scaling system and trying to treat it as one will inevitably leave you disappointed.

Are these 3.2 million metrics from all sorts of different applications running on these servers too?
The Prometheus way to do it here would be to apply some basic form of functional sharding. If you are monitoring your entire stack, there's generally little practical need to co-locate and correlate all possible metrics.

It's totally sane to have e.g. three Prometheus servers. One for node-exporters, one for frontend services, one for databases.
That's just one example of course. You should co-locate metrics that make sense to be. But at some point you just cannot handle "everything" in a single place anymore and Prometheus favours practicality and resilience over auto-magic distributed properties that are a huge failure potential in themselves.

There's a chance we'll figure something out in the future if remote-storage comes into play. But it likely won't remove the idea to functionally shard your instances.

@onorua

This comment has been minimized.

Copy link

onorua commented Dec 5, 2016

I'm absolutely fine with sharding idea. I did shard our nodes by node_exporter/cadviser/our_own_exporter, but apart from theoretical simplicity, you have operational and maintenance overhead. Let me elaborate on this:

Right now we have 700 nodes managed by one prometheus server (fraction of cadviser), which is +- limit for current 32GB RAM server.

We want to replace all our 8GB RAM compute nodes with 4 times smaller one's, which means we will have around 3200 nodes in peak and around 2000 off-peak. So, the only way to handle this amount of metrics and nodes, is to use sharding. But I can't do sharding based on purpose or metric type, most probably I'll be doing sharding based on hash. Because otherwise you are dealing with your prometheus as a "pet", each has different config which you can't automate. So, you have got like 6 prometheus servers to handle this traffic solely for node_exporter, and you double it to make HA of your data.

Than you do maintenance, release upgrade, kernel upgrade, what ever, and you have got some slight increase in CPU usage, first thing you do - go to your prometheus server and dig as deep as you can. But, as you did hash based sharding, you have no clue where to go. So you go to each and every node, and make a query. But than, you discover that one of the nodes were restarting and due to this, some data was not scraped that time, but you must come to it's twin to get it. Basically in order to make some sort of analysis in this case - you may need to do 12 times more work than if it would be one server.

There is no magic in what I describe, this kind of federation we have now, applies additional overhead on operation's team. Which basically forced DO guys to reimplement Prometheus.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 5, 2016

So you go to each and every node, and make a query.

The trick to this is to federate up one non-aggregated metric per host that includes a label indicating the slave into the master Prometheus, and use that to find the slave Prometheus you want.

@onorua

This comment has been minimized.

Copy link

onorua commented Dec 6, 2016

@brian-brazil thanks for the suggestion! This should work in static environment. I have problems applying this into dynamic environment though.
Imagine, you have 4 nodes, hash based sharding, nodes are registered to consul. There is a process which reads amount of nodes, and make changes based on math, so if you have 4 nodes, you have each 4th node to scrape, but if you have 5 nodes, you do each 5th.
Nothing fancy, just boring service discovery.
Then you add one node to the shards (2 nodes in practice) and you have got part of the data about the node on one prometheus server, part is on another one. Does it convenient for operator to go to 2 nodes and use some magic to merge output from these two?
You might say that I can have static configuration, and use one server for first 100-300 nodes, second for 300-600 and so forth. But if you have dynamic environment, and your node lives for like 12 hours max, I do not see how to get it working in a sane way.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

There is a process which reads amount of nodes, and make changes based on math, so if you have 4 nodes, you have each 4th node to scrape, but if you have 5 nodes, you do each 5th.

Changing the number of slaves should be quite rare, every few years maybe. That doesn't affect the trick I proposed though.

Does it convenient for operator to go to 2 nodes and use some magic to merge output from these two?

That's what the master node is for, to do aggregation.

But if you have dynamic environment, and your node lives for like 12 hours max, I do not see how to get it working in a sane way.

Nothing changes, it'll just work.

@onorua

This comment has been minimized.

Copy link

onorua commented Dec 6, 2016

@brian-brazil

Changing the number of slaves should be quite rare, every few years maybe.

with all the respect, why do you think that changing amount of nodes is quite rare or it should be rare? I've tripled amount of nodes within past 2 weeks, and this is just the beginning, we are not even close to required capacity. We will make amount of nodes x4, because we are moving to micro-nodes and micro-services, which means every couple of weeks, I'll add new prometheus node.

That's what the master node is for, to do aggregation.

Completely agree with you! That is the whole point of the ticket! Let us use master node for federation (hide internal topology) and deep node analysis simultaneously or at least transparently and everybody will be happy.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

with all the respect, why do you think that changing amount of nodes is quite rare or it should be rare?

You should be capacity planning so that it's rare. On the (large) systems I've worked on previously we doubled every 2-3 years. Adding nodes to a setup like this is disruptive, you want to keep it rare.

Let us use master node for federation (hide internal topology) and deep node analysis simultaneously or at least transparently and everybody will be happy.

That's not what this issue is requesting. Attempting to use what this issue requests to help scaling issues won't help, as you're still pulling all data into the master.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

A short retention period with new storage covers this.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.