Skip to content

Commit

Permalink
Add MOTIVATION.md -- an explanation of "why promxy"
Browse files Browse the repository at this point in the history
  • Loading branch information
jacksontj committed Apr 24, 2018
1 parent 84326df commit 00ba3af
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 0 deletions.
55 changes: 55 additions & 0 deletions MOTIVATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## Promxy, a prometheus scaling story

### The beginning
Prometheus is an all-in-one metrics and alerting system. The fact that everything is built in is quite convenient when doing initial setup and testing. Throw grafana in front of that and we are cooking with gas! At this scale-- there where no concerns, only snappy metrics and pretty graphs.

### Redundancy
After installing the first prometheus host you realize that you need redundancy. To do this you stand up a second prometheus host with the same scrape config. At this point you have 2 nodes with the data, but grafana only pointing at one. Quickly you put a load balancer in front and grafana load balances between the 2 nodes -- problem solved! Then some time in the future you have to reboot a prometheus host. After the host reboots you notice that you have holes in the graphs 50% of the time. With prometheus itself there is no solution short or long term for this as there is no cross-host merging and prometheus' datastore doesn't support backfilling data.

### Sharding
As you continue to scale (adding more machines and metrics) you quickly realize that
all of your metrics cannot fit on a single host anymore. No problem, we'll shard the
scrape config! The suggested way in the prometheus community is to split your metrics
based on application-- so you dutifully do so. Now you have a cluster per-app for
metrics separate. Soon after though you realize that there are lots of servers, so
its not even feasibly for all the app metrics in a single shard -- you need to split
them. You do this by region/az/etc. but now grafana is littered with so many
prometheus datasources! No problem you say to yourself, as you switch from using
a single source to using mixed sources -- and adding each region/az/etc. with the
same promql statements.

### Aggregation
As you've settled into running prometheus in an N shard by M replica setup (with N
selectors in grafana) and having to duplicate promql statements in grafana -- you
realize that you have some metrics/alerts that need to be global (such as global QPS,
latency, etc.). And here you realize that it cannot be done with the infrastructure you
have setup! This is no problem though, as you read through the prometheus community
forums etc. and realize that you are supposed to set up a global aggregation shard of
prometheus to scrape all your other prometheus instances. You set this up, but then
realize that the recommendations are to summarize data (which makes sense, since you
cannot just shove N hosts worth of data onto 1). Since these are global metrics you
rationalize that it is okay to drop some granularity for the sake of having them and
then you look over the rest of the cluster only to realize that this federation is
accounting for 90%+ of all queries to prometheus -- meaning you now have to uplift your
other existing shards just to get your global metrics.

### Despair
At this point you consider your situation:

- You have metrics
- You have redundancy (with the occasional hole on a restart or node failure)
- You have **many** prometheus data sources in grafana (which is confusing to all your grafana users -- as well as yourself!)
- You have Aggregation set up -- which (1) accounts for the majority of load on the other promethes hosts (2) is at a lower granularity than you'd like, and (3) now means that you have to maintain separate alerting rules for the **aggregation** layers from the rest of the prometheus hosts.

And you tell yourself, this seems too complicated; there must be a better way!

### Promxy
You google around (or ask a friend) and you find out about this tool -- promxy
(what a silly name). You set it up and you immediately able to solve your pain points:

- no more "holes" in metrics
- single source in grafana
- no need for aggregation layers of prometheus anymore!

In addition to solving the pain points you get access logs (which you didn't even
put on the list, but admit it: you've been missing them).
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ and use of prometheus at scale (when you have more than one prometheus host).
Promxy delivers this unified access endpoint without requiring **any** sidecars,
custom-builds, or other changes to your prometheus infrastructure.

## Why promxy?
[**Detailed version**](MOTIVATION.md)

**Short version**:
Prometheus itself provides no real HA/clustering support. As such the best-practice
is to run multiple (e.g N) hosts with the same config. Similarly prometheus has no real
built-in query federation, which means that you end up with N sources in grafana
that is (1) confusing grafana users and (2) no support for aggregation across the sources.
Promxy enables an HA prometheus setup by "merging" the data from the duplicate
hosts (so if there is a gap in one, promxy will fill with the other). In addition
Promxy provides a single datasource for all promql queries -- meaning your grafana
can have a single source and you can have globally aggregated promql queries.

## Quickstart
Release binaries are available on the [releases](https://github.com/jacksontj/promxy/releases) page.

Expand Down

0 comments on commit 00ba3af

Please sign in to comment.