better document priority and ready state

grafana · Feb 12, 2019 · 7f79b18 · 7f79b18
1 parent 544f48f
commit 7f79b18
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 6 deletions.
diff --git a/docs/clustering.md b/docs/clustering.md
@@ -83,6 +83,39 @@ partitions | 0,1 | 0,2 | 1,3 | 2,3 |
 This would offer better load balancing should node A fail (B and C will each take over a portion of the load), but will require making primary status a per-partition concept.
 Hence, this is currently **not supported**.
 
+## Priority and ready state
+
+Priority is a measure of how in-sync a metrictank process is.
+
+| input plugin  | priority                                                        |
+| ------------- | --------------------------------------------------------------- |
+| carbon-in     | 0                                                               |
+| prometheus-in | 0                                                               |
+| kafka-mdm-in  | estimate of consumer lag in seconds. 10k if unknown (2.8 hours) |
+
+* the priority value is gossipped to all peers in a sharding cluster
+* To satisfy queries from users, requests are fanned out across the cluster across all ready instances (see below), and lower priority instances are favored.
+* Can be analyzed via http endpoints like `/cluster`, `/priority`, `/node` or via the dashboard/metrics.
+
+Readyness or "ready state":
+
+(whenever we say "ready", "ready state" we mean the value taking into account priority, not the internal NodeState, as explained below)
+
+* indicates whether an instance is considered ready to satisfy data requests
+* refuses data or index requests when not ready
+* Can be checked via the `/` http endpoint. https://github.com/grafana/metrictank/blob/master/docs/http-api.md#get-app-status
+* Can control the GC setting via the `cluster.gc-percent-not-ready` setting.
+
+A node is ready when all of the following are true:
+* priority does not exceed the `cluster.max-priority` setting, which defaults to 10.
+* its internal NodeState is ready, which happens:
+  * for primary nodes, immediately after startup (loading index, starting input plugins, etc)
+  * for secondary nodes, "warm-up-period" after startup.
+
+Special cases:
+* the `/node` and `/cluster` endpoint shows the internal state of the node, including the internal NodeState.
+* what is gossiped across the cluster is also the full internal node state (including NodeState, priority, etc)
+* The `cluster.self.state.ready.gauge1` metric is also the internal NodeState, whereas the `cluster.total.state` metrics use the normal ready state.
 
 ## Caveats
 
@@ -96,5 +129,4 @@ Enable the `create-keyspace` on only one node, or leave it enabled for all, but
 
 ## Other
 - use min-available-shards to control how many unavailable shards are tolerable
-- use max-priority to control how much priority / data log is tolerable (note: lowest lag shards are preferred)
 - note: currently if a shard fails, it doesn't retry other instance in the same request
diff --git a/docs/http-api.md b/docs/http-api.md
@@ -15,8 +15,8 @@ POST /
 
 returns:
 
-* `200 OK` if the node is primary or a warmed up secondary (`warmupPeriod` has elapsed)
-* `503 Service not ready` if the node is secondary and not yet warmed up.
+* `200 OK` if the node is [ready](clustering.md#priority-and-ready-state)
+* `503 Service not ready` otherwise.
 
 #### Example
 
@@ -132,7 +132,7 @@ returns a json document with the following fields:
 * "primary": whether the node is a primary node or not
 * "primaryChange": timestamp of when the primary state last changed
 * "version": metrictank version
-* "state": whether the node is ready to handle requests or not
+* "state": whether the node is [ready](clustering.md#priority-and-ready-state) to handle requests or not
 * "stateChange": timestamp of when the state last changed
 * "started": timestamp of when the node started up
 

diff --git a/docs/startup.md b/docs/startup.md
@@ -22,11 +22,12 @@ The full startup procedure has many details, but here we cover the main steps if
 | init Index              | creates session, keyspace, tables, write queues, etc and loads in-memory index from persisted data | reasonable RAM and CPU increase                    |
 | create cluster notifier | optional: connects to Kafka, starts backfilling persistence message and waits until done or timeout| if backfilling: above-normal CPU, normal RAM usage |
 | start input plugin(s)   | starts backfill (kafka) or listening (carbon, prometheus) and maintain priority based on input lag | if backfilling: above-normal CPU and RAM usage     |
-| mark ready state        | immediately (primary) or after warmup period (secondary) (combined with priority for clustering)   | no                                                 |
+| mark ready state        | immediately (primary) / after warmup (secondary) [details](clustering.md#priority-and-ready-state) | no                                                 |
 
 We recommend provisioning a cluster such that it can backfill a 7 hour backlog in half on hour or less. This means:
 * The CPU increase during the kafka backfilling is very significant: typically a 14x cpu increase compared to normal usage.
-* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal.
+* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal,
+  though the `cluster.gc-percent-not-ready` setting lets you trade cpu for memory usage during startup.
 
 Backfilling will go as fast as it can until it reaches a bottleneck (kafka brokers, cpu constraints, etc), so your numbers may vary.