New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explaining a HA + scalable setup? #1500

Closed
KlavsKlavsen opened this Issue Mar 24, 2016 · 18 comments

Comments

Projects
None yet
7 participants
@KlavsKlavsen

KlavsKlavsen commented Mar 24, 2016

Hi guys,

I've been looking at replacing graphite for sometime.. being pretty bound to the push model here though.

I'm trying to figure out to do high-availability setups with prometheus? I was planning on setting it up on 2 physical servers - and I'd like my grafana dashboard and collecting still working, when a physical server is downed.. and it would be nice with loadsharing, but I'd be fine with DRBD between the two if need be.

Can you point to anything explaining how this works with prometheus.. I read about the federation - but can't see that giving me HA.. manual scaling though.. meaning that the grafana graphs would need to be manually edited to query the correct prometheus server (the one which actually has the data) as I understand it ?

Is there any routing/discovery - so grafana could query just one node.. and then it would route the request to the server who had the data?

@matthiasr

This comment has been minimized.

Show comment
Hide comment
@matthiasr

matthiasr Mar 24, 2016

Contributor

For HA, simply run the two Prometheus servers independently with the same configuration. They'll have the same data, modulo sample timings.

Make a way to query one of the Prometheus servers so that you can switch it to the other. We're mostly doing this with simple CNAMEs in DNS; when a server dies we just switch the CNAME to the other. If you want it to be more automatic, use VIPs or loadbalancers. Configure this name in Grafana, i.e. pretend to Grafana that there is only one server.

The only downside is that while one server is down, it won't collect, so it will have holes in its data for that time. This would be the same with federation.

Contributor

matthiasr commented Mar 24, 2016

For HA, simply run the two Prometheus servers independently with the same configuration. They'll have the same data, modulo sample timings.

Make a way to query one of the Prometheus servers so that you can switch it to the other. We're mostly doing this with simple CNAMEs in DNS; when a server dies we just switch the CNAME to the other. If you want it to be more automatic, use VIPs or loadbalancers. Configure this name in Grafana, i.e. pretend to Grafana that there is only one server.

The only downside is that while one server is down, it won't collect, so it will have holes in its data for that time. This would be the same with federation.

@matthiasr

This comment has been minimized.

Show comment
Hide comment
@matthiasr

matthiasr Mar 24, 2016

Contributor

For scaling on the other hand, there are some things you can do with relabelling and hashmod, but we normally just split a Prometheus that gets too loaded by giving services their own Prometheus. This obviously only works well if you have many services with reasonable metric/instance counts each, so you have some line to split along.

Contributor

matthiasr commented Mar 24, 2016

For scaling on the other hand, there are some things you can do with relabelling and hashmod, but we normally just split a Prometheus that gets too loaded by giving services their own Prometheus. This obviously only works well if you have many services with reasonable metric/instance counts each, so you have some line to split along.

@KlavsKlavsen

This comment has been minimized.

Show comment
Hide comment
@KlavsKlavsen

KlavsKlavsen Mar 30, 2016

Thank you for feedback. issue with "just running two" servers, is that you'll have to collect twice (ie. double load on servers you collect from).. but as long as the agents (hopefully) just collect and store in mem.. then it won't be much of a load (since it won't fetch the details twice from the hosts we're collecting from). I'd probably have to go with graphite_collector - since opening firewall from prometheus server to ALL servers, will be a fight uphill (and to be fair it is a security risk if anyone finds a bug in the daemon listening - they have to own the prometheus server first though.. but still)

It would be a very cool feature, if federation worked like a "shared cloud" like thing.. (like LDAP/MS AD f.ex. ) ie. each server, knew what stat-keys (ie. in graphite it would be the paths.. servers.server01.cpu...) the others had - so you could query any server, and get forwarded (http 301 or just proxy request) to a server who has the data.

KlavsKlavsen commented Mar 30, 2016

Thank you for feedback. issue with "just running two" servers, is that you'll have to collect twice (ie. double load on servers you collect from).. but as long as the agents (hopefully) just collect and store in mem.. then it won't be much of a load (since it won't fetch the details twice from the hosts we're collecting from). I'd probably have to go with graphite_collector - since opening firewall from prometheus server to ALL servers, will be a fight uphill (and to be fair it is a security risk if anyone finds a bug in the daemon listening - they have to own the prometheus server first though.. but still)

It would be a very cool feature, if federation worked like a "shared cloud" like thing.. (like LDAP/MS AD f.ex. ) ie. each server, knew what stat-keys (ie. in graphite it would be the paths.. servers.server01.cpu...) the others had - so you could query any server, and get forwarded (http 301 or just proxy request) to a server who has the data.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Mar 30, 2016

Member

so you could query any server, and get forwarded (http 301 or just proxy request) to a server who has the data.

This is not a feature we plan to add. Prometheus is intended for your critical monitoring, and having a full-mesh of communication between servers opens risks around cascading failures and all the complexities of having to synchronise metadata across servers.

In practice you'll almost always know which Prometheus server has the data you're looking for, so it's not a major problem in the real world.

Member

brian-brazil commented Mar 30, 2016

so you could query any server, and get forwarded (http 301 or just proxy request) to a server who has the data.

This is not a feature we plan to add. Prometheus is intended for your critical monitoring, and having a full-mesh of communication between servers opens risks around cascading failures and all the complexities of having to synchronise metadata across servers.

In practice you'll almost always know which Prometheus server has the data you're looking for, so it's not a major problem in the real world.

@shortdudey123

This comment has been minimized.

Show comment
Hide comment
@shortdudey123

shortdudey123 Apr 11, 2016

you'll almost know

How do you deal with the cases in which you don't? What is best practice for those cases?

shortdudey123 commented Apr 11, 2016

you'll almost know

How do you deal with the cases in which you don't? What is best practice for those cases?

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Apr 11, 2016

Member

@shortdudey123 I think the "best practice" currently is to set up your topology in such a way that you know where to find the metrics you are interested in. In some situations, a label can give you a clue about the source (like when you're doing hierarchical federation from per-zone Prometheus servers to a global one, the data coming from the per-zone Prometheuses would get a zone label attached, which would tell you where to go to find more detail).

It would be interesting to think about better auto-discovery (UI?) features in the future, as long as those don't lead to stronger coupling of components, reliability-wise.

Member

juliusv commented Apr 11, 2016

@shortdudey123 I think the "best practice" currently is to set up your topology in such a way that you know where to find the metrics you are interested in. In some situations, a label can give you a clue about the source (like when you're doing hierarchical federation from per-zone Prometheus servers to a global one, the data coming from the per-zone Prometheuses would get a zone label attached, which would tell you where to go to find more detail).

It would be interesting to think about better auto-discovery (UI?) features in the future, as long as those don't lead to stronger coupling of components, reliability-wise.

@shortdudey123

This comment has been minimized.

Show comment
Hide comment
@shortdudey123

shortdudey123 Apr 11, 2016

you know where to find the metrics you are interested in

This is a perfect world scenario that does not always happen even with the best planing.

Example that was mentioned above: when a physical server is downed.
When a host is brought back up, it will have holes in its dataset. In this case, you would need to have something tell you to go to the other host in the pair to get that data.

shortdudey123 commented Apr 11, 2016

you know where to find the metrics you are interested in

This is a perfect world scenario that does not always happen even with the best planing.

Example that was mentioned above: when a physical server is downed.
When a host is brought back up, it will have holes in its dataset. In this case, you would need to have something tell you to go to the other host in the pair to get that data.

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Apr 11, 2016

Member

This is correct. There may be some difference in expectations here. If you need 100% accuracy in data reporting, Prometheus is not for you (see also http://prometheus.io/docs/introduction/overview/#when-does-it-not-fit). We generally tolerate occasional gaps in collection (caused by network problems or downed Prometheuses, etc.), as long as there is always one available Prometheus server for the purpose of current fault detection. Prometheus's local storage is generally not considered a durable, replicated, or arbitrarily long-term storage - even if it works well for weeks, months, or sometimes even years in practice.

"Real" long-term storage is still in the works and aims to address individual down servers and replication of data better. It will be decoupled enough from the main Prometheus servers that these won't have to hard-depend on it though.

Member

juliusv commented Apr 11, 2016

This is correct. There may be some difference in expectations here. If you need 100% accuracy in data reporting, Prometheus is not for you (see also http://prometheus.io/docs/introduction/overview/#when-does-it-not-fit). We generally tolerate occasional gaps in collection (caused by network problems or downed Prometheuses, etc.), as long as there is always one available Prometheus server for the purpose of current fault detection. Prometheus's local storage is generally not considered a durable, replicated, or arbitrarily long-term storage - even if it works well for weeks, months, or sometimes even years in practice.

"Real" long-term storage is still in the works and aims to address individual down servers and replication of data better. It will be decoupled enough from the main Prometheus servers that these won't have to hard-depend on it though.

@shortdudey123

This comment has been minimized.

Show comment
Hide comment
@shortdudey123

shortdudey123 Apr 11, 2016

We generally tolerate occasional gaps in collection

Gotcha, thanks for the quick replies :)

shortdudey123 commented Apr 11, 2016

We generally tolerate occasional gaps in collection

Gotcha, thanks for the quick replies :)

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Apr 11, 2016

Member

@shortdudey123 Note that I wouldn't worry about gaps too much in practice. They should still be pretty rare, so I would still use Prometheus for trending over reasonable periods of time, just not if the trend data is really critical to my business.

Member

juliusv commented Apr 11, 2016

@shortdudey123 Note that I wouldn't worry about gaps too much in practice. They should still be pretty rare, so I would still use Prometheus for trending over reasonable periods of time, just not if the trend data is really critical to my business.

@shortdudey123

This comment has been minimized.

Show comment
Hide comment
@shortdudey123

shortdudey123 Apr 11, 2016

They should still be pretty rare

Unless you run the server in the cloud :p

shortdudey123 commented Apr 11, 2016

They should still be pretty rare

Unless you run the server in the cloud :p

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Apr 11, 2016

Member

We generally recommend running Prometheus servers in the same failure domain as the jobs they monitor. Otherwise you won't get a very reliable monitoring setup no matter what solution you are using. For example, run one Prometheus server (or HA pair) in each zone / cluster, monitoring jobs only in that zone. Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones. Have meta-monitoring on top of that. If your topology is even that big.

Or maybe you mean you run everything in the cloud - the cloud itself shouldn't be that unreliable though? :)

Or maybe you mean you're dynamically re-scheduling even your Prometheus servers all the time via some cluster manager, in which case, yes, you're kind of screwed unless you have good network file systems or block devices that get shared across instance migrations. :)

Member

juliusv commented Apr 11, 2016

We generally recommend running Prometheus servers in the same failure domain as the jobs they monitor. Otherwise you won't get a very reliable monitoring setup no matter what solution you are using. For example, run one Prometheus server (or HA pair) in each zone / cluster, monitoring jobs only in that zone. Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones. Have meta-monitoring on top of that. If your topology is even that big.

Or maybe you mean you run everything in the cloud - the cloud itself shouldn't be that unreliable though? :)

Or maybe you mean you're dynamically re-scheduling even your Prometheus servers all the time via some cluster manager, in which case, yes, you're kind of screwed unless you have good network file systems or block devices that get shared across instance migrations. :)

@fabxc fabxc added kind/question and removed question labels Apr 28, 2016

@willbryant

This comment has been minimized.

Show comment
Hide comment
@willbryant

willbryant May 11, 2016

If you run two nodes to provide HA, does that mean that you'll get twice the alerts?

willbryant commented May 11, 2016

If you run two nodes to provide HA, does that mean that you'll get twice the alerts?

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil May 11, 2016

Member

No, the alertmanager will deduplicate them.

Member

brian-brazil commented May 11, 2016

No, the alertmanager will deduplicate them.

@willbryant

This comment has been minimized.

Show comment
Hide comment
@willbryant

willbryant May 12, 2016

But at the moment alertmanager has no native clustering, so if you run prometheus on two nodes, won't you have two alertmanagers?

willbryant commented May 12, 2016

But at the moment alertmanager has no native clustering, so if you run prometheus on two nodes, won't you have two alertmanagers?

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil May 12, 2016

Member

You'd run a single alertmanager in that scenario. While you may have many Prometheus servers, you'd only have one alertmanager so manual failover is sufficient for HA.

Member

brian-brazil commented May 12, 2016

You'd run a single alertmanager in that scenario. While you may have many Prometheus servers, you'd only have one alertmanager so manual failover is sufficient for HA.

@willbryant

This comment has been minimized.

Show comment
Hide comment
@willbryant

willbryant May 12, 2016

Hmm OK. So the dashboard might be HA, but alerts won't be - if the server hosting alerts goes down, then we won't get an alert to tell us about it :).

willbryant commented May 12, 2016

Hmm OK. So the dashboard might be HA, but alerts won't be - if the server hosting alerts goes down, then we won't get an alert to tell us about it :).

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil May 12, 2016

Member

It's always important to have independent monitoring of your monitoring system. It's all HA, even with manual failover.

Member

brian-brazil commented May 12, 2016

It's always important to have independent monitoring of your monitoring system. It's all HA, even with manual failover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment