Add MOTIVATION.md -- an explanation of "why promxy"

jacksontj · Apr 24, 2018 · 00ba3af · 00ba3af
1 parent 84326df
commit 00ba3af
Show file tree

Hide file tree

Showing 2 changed files with 68 additions and 0 deletions.
diff --git a/MOTIVATION.md b/MOTIVATION.md
@@ -0,0 +1,55 @@
+## Promxy, a prometheus scaling story
+
+### The beginning
+Prometheus is an all-in-one metrics and alerting system. The fact that everything is built in is quite convenient when doing initial setup and testing. Throw grafana in front of that and we are cooking with gas! At this scale-- there where no concerns, only snappy metrics and pretty graphs.
+
+### Redundancy
+After installing the first prometheus host you realize that you need redundancy. To do this you stand up a second prometheus host with the same scrape config. At this point you have 2 nodes with the data, but grafana only pointing at one. Quickly you put a load balancer in front and grafana load balances between the 2 nodes -- problem solved! Then some time in the future you have to reboot a prometheus host. After the host reboots you notice that you have holes in the graphs 50% of the time. With prometheus itself there is no solution short or long term for this as there is no cross-host merging and prometheus' datastore doesn't support backfilling data.
+
+### Sharding
+As you continue to scale (adding more machines and metrics) you quickly realize that
+all of your metrics cannot fit on a single host anymore. No problem, we'll shard the
+scrape config! The suggested way in the prometheus community is to split your metrics
+based on application-- so you dutifully do so. Now you have a cluster per-app for
+metrics separate. Soon after though you realize that there are lots of servers, so
+its not even feasibly for all the app metrics in a single shard -- you need to split
+them. You do this by region/az/etc. but now grafana is littered with so many
+prometheus datasources! No problem you say to yourself, as you switch from using
+a single source to using mixed sources -- and adding each region/az/etc. with the
+same promql statements.
+
+### Aggregation
+As you've settled into running prometheus in an N shard by M replica setup (with N
+selectors in grafana) and having to duplicate promql statements in grafana -- you
+realize that you have some metrics/alerts that need to be global (such as global QPS,
+latency, etc.). And here you realize that it cannot be done with the infrastructure you
+have setup! This is no problem though, as you read through the prometheus community
+forums etc. and realize that you are supposed to set up a global aggregation shard of
+prometheus to scrape all your other prometheus instances. You set this up, but then
+realize that the recommendations are to summarize data (which makes sense, since you
+cannot just shove N hosts worth of data onto 1). Since these are global metrics you
+rationalize that it is okay to drop some granularity for the sake of having them and
+then you look over the rest of the cluster only to realize that this federation is
+accounting for 90%+ of all queries to prometheus -- meaning you now have to uplift your
+other existing shards just to get your global metrics.
+
+### Despair
+At this point you consider your situation:
+
+- You have metrics
+- You have redundancy (with the occasional hole on a restart or node failure)
+- You have **many** prometheus data sources in grafana (which is confusing to all your grafana users -- as well as yourself!)
+- You have Aggregation set up -- which (1) accounts for the majority of load on the other promethes hosts (2) is at a lower granularity than you'd like, and (3) now means that you have to maintain separate alerting rules for the **aggregation** layers from the rest of the prometheus hosts.
+
+And you tell yourself, this seems too complicated; there must be a better way!
+
+### Promxy
+You google around (or ask a friend) and you find out about this tool -- promxy
+(what a silly name). You set it up and you immediately able to solve your pain points:
+
+-  no more "holes" in metrics
+-  single source in grafana
+-  no need for aggregation layers of prometheus anymore!
+
+In addition to solving the pain points you get access logs (which you didn't even
+put on the list, but admit it: you've been missing them).
diff --git a/README.md b/README.md
@@ -7,6 +7,19 @@ and use of prometheus at scale (when you have more than one prometheus host).
 Promxy delivers this unified access endpoint without requiring **any** sidecars,
 custom-builds, or other changes to your prometheus infrastructure.
 
+## Why promxy?
+[**Detailed version**](MOTIVATION.md)
+
+**Short version**:
+Prometheus itself provides no real HA/clustering support. As such the best-practice
+is to run multiple (e.g N) hosts with the same config. Similarly prometheus has no real
+built-in query federation, which means that you end up with N sources in grafana
+that is (1) confusing grafana users and (2) no support for aggregation across the sources.
+Promxy enables an HA prometheus setup by "merging" the data from the duplicate
+hosts (so if there is a gap in one, promxy will fill with the other). In addition
+Promxy provides a single datasource for all promql queries -- meaning your grafana
+can have a single source and you can have globally aggregated promql queries.
+
 ## Quickstart
 Release binaries are available on the [releases](https://github.com/jacksontj/promxy/releases) page.