grafana · Dieterbe · Feb 12, 2019 · Feb 12, 2019 · Feb 12, 2019 · Feb 12, 2019
diff --git a/api/routes.go b/api/routes.go
@@ -68,8 +68,8 @@ func (s *Server) RegisterRoutes() {
 	r.Combo("/tags/autoComplete/tags", withOrg, ready, bind(models.GraphiteAutoCompleteTags{})).Get(s.graphiteAutoCompleteTags).Post(s.graphiteAutoCompleteTags)
 	r.Combo("/tags/autoComplete/values", withOrg, ready, bind(models.GraphiteAutoCompleteTagValues{})).Get(s.graphiteAutoCompleteTagValues).Post(s.graphiteAutoCompleteTagValues)
 	r.Post("/tags/delSeries", withOrg, ready, bind(models.GraphiteTagDelSeries{}), s.graphiteTagDelSeries)
-	r.Combo("/functions", withOrg, ready).Get(s.graphiteFunctions).Post(s.graphiteFunctions)
-	r.Combo("/functions/:func(.+)", withOrg, ready).Get(s.graphiteFunctions).Post(s.graphiteFunctions)
+	r.Combo("/functions", withOrg).Get(s.graphiteFunctions).Post(s.graphiteFunctions)
+	r.Combo("/functions/:func(.+)", withOrg).Get(s.graphiteFunctions).Post(s.graphiteFunctions)
 
 	// Prometheus endpoints
 	r.Combo("/prometheus/api/v1/query_range", cBody, withOrg, ready, form(models.PrometheusRangeQuery{})).Get(s.prometheusQueryRange).Post(s.prometheusQueryRange)

diff --git a/docs/clustering.md b/docs/clustering.md
@@ -83,6 +83,41 @@ partitions | 0,1 | 0,2 | 1,3 | 2,3 |
 This would offer better load balancing should node A fail (B and C will each take over a portion of the load), but will require making primary status a per-partition concept.
 Hence, this is currently **not supported**.
 
+## Priority and ready state
+
+Priority is a measure of how in-sync a metrictank process is, expressed in seconds.
+
+| input plugin  | priority                 |
+| ------------- | ------------------------ |
+| carbon-in     | 0                        |
+| prometheus-in | 0                        |
+| kafka-mdm-in  | estimate of consumer lag |
+
+When the input plugin is not sure, or not started yet priority is 10k (2.8 hours)
+
+* the priority value is gossipped to all peers in a sharding cluster
+* To satisfy queries from users, requests are fanned out across the cluster across all ready instances (see below), favoring lower priority instances.
+* The priority can be inspected via http endpoints like `/cluster`, `/priority`, `/node` or via the dashboard/metrics.
+
+Readyness or "ready state":
+
+(whenever we say "ready", "ready state" we mean the value taking into account priority, not the internal NodeState, as explained below)
+
+* indicates whether an instance is considered ready to satisfy data requests
+* refuses data or index requests when not ready
+* Can be checked via the `/` http endpoint. [more info](http-api.md#get-app-status)
+* Can control the GC setting via the `cluster.gc-percent-not-ready` setting.
+
+A node is ready when all of the following are true:
+* priority does not exceed the `cluster.max-priority` setting, which defaults to 10.
+* its internal NodeState is ready, which happens:
+  * for primary nodes, immediately after startup (loading index, starting input plugins, etc)
+  * for secondary nodes, "warm-up-period" after startup.
+
+Special cases:
+* the `/node` and `/cluster` endpoint shows the internal state of the node, including the internal NodeState.
+* what is gossiped across the cluster is also the full internal node state (including NodeState, priority, etc)
+* The `cluster.self.state.ready.gauge1` metric is also the internal NodeState, whereas the `cluster.total.state` metrics use the normal ready state.
 
 ## Caveats
 
@@ -96,5 +131,4 @@ Enable the `create-keyspace` on only one node, or leave it enabled for all, but
 
 ## Other
 - use min-available-shards to control how many unavailable shards are tolerable
-- use max-priority to control how much priority / data log is tolerable (note: lowest lag shards are preferred)
 - note: currently if a shard fails, it doesn't retry other instance in the same request
diff --git a/docs/http-api.md b/docs/http-api.md
@@ -15,8 +15,8 @@ POST /
 
 returns:
 
-* `200 OK` if the node is primary or a warmed up secondary (`warmupPeriod` has elapsed)
-* `503 Service not ready` if the node is secondary and not yet warmed up.
+* `200 OK` if the node is [ready](clustering.md#priority-and-ready-state)
+* `503 Service not ready` otherwise.
 
 #### Example
 
@@ -132,7 +132,7 @@ returns a json document with the following fields:
 * "primary": whether the node is a primary node or not
 * "primaryChange": timestamp of when the primary state last changed
 * "version": metrictank version
-* "state": whether the node is ready to handle requests or not
+* "state": whether the node is [ready](clustering.md#priority-and-ready-state) to handle requests or not
 * "stateChange": timestamp of when the state last changed
 * "started": timestamp of when the node started up
 

diff --git a/docs/startup.md b/docs/startup.md
@@ -22,11 +22,12 @@ The full startup procedure has many details, but here we cover the main steps if
 | init Index              | creates session, keyspace, tables, write queues, etc and loads in-memory index from persisted data | reasonable RAM and CPU increase                    |
 | create cluster notifier | optional: connects to Kafka, starts backfilling persistence message and waits until done or timeout| if backfilling: above-normal CPU, normal RAM usage |
 | start input plugin(s)   | starts backfill (kafka) or listening (carbon, prometheus) and maintain priority based on input lag | if backfilling: above-normal CPU and RAM usage     |
-| mark ready state        | immediately (primary) or after warmup period (secondary) (combined with priority for clustering)   | no                                                 |
+| mark ready state        | immediately (primary) / after warmup (secondary) [details](clustering.md#priority-and-ready-state) | no                                                 |
 
 We recommend provisioning a cluster such that it can backfill a 7 hour backlog in half on hour or less. This means:
 * The CPU increase during the kafka backfilling is very significant: typically a 14x cpu increase compared to normal usage.
-* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal.
+* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal,
+  though the `cluster.gc-percent-not-ready` setting lets you trade cpu for memory usage during startup.
 
 Backfilling will go as fast as it can until it reaches a bottleneck (kafka brokers, cpu constraints, etc), so your numbers may vary.