Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Commit

Permalink
better document priority and ready state
Browse files Browse the repository at this point in the history
  • Loading branch information
Dieterbe committed Feb 12, 2019
1 parent 544f48f commit 7f79b18
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 6 deletions.
34 changes: 33 additions & 1 deletion docs/clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,39 @@ partitions | 0,1 | 0,2 | 1,3 | 2,3 |
This would offer better load balancing should node A fail (B and C will each take over a portion of the load), but will require making primary status a per-partition concept.
Hence, this is currently **not supported**.

## Priority and ready state

Priority is a measure of how in-sync a metrictank process is.

| input plugin | priority |
| ------------- | --------------------------------------------------------------- |
| carbon-in | 0 |
| prometheus-in | 0 |
| kafka-mdm-in | estimate of consumer lag in seconds. 10k if unknown (2.8 hours) |

* the priority value is gossipped to all peers in a sharding cluster
* To satisfy queries from users, requests are fanned out across the cluster across all ready instances (see below), and lower priority instances are favored.
* Can be analyzed via http endpoints like `/cluster`, `/priority`, `/node` or via the dashboard/metrics.

Readyness or "ready state":

(whenever we say "ready", "ready state" we mean the value taking into account priority, not the internal NodeState, as explained below)

* indicates whether an instance is considered ready to satisfy data requests
* refuses data or index requests when not ready
* Can be checked via the `/` http endpoint. https://github.com/grafana/metrictank/blob/master/docs/http-api.md#get-app-status
* Can control the GC setting via the `cluster.gc-percent-not-ready` setting.

A node is ready when all of the following are true:
* priority does not exceed the `cluster.max-priority` setting, which defaults to 10.
* its internal NodeState is ready, which happens:
* for primary nodes, immediately after startup (loading index, starting input plugins, etc)
* for secondary nodes, "warm-up-period" after startup.

Special cases:
* the `/node` and `/cluster` endpoint shows the internal state of the node, including the internal NodeState.
* what is gossiped across the cluster is also the full internal node state (including NodeState, priority, etc)
* The `cluster.self.state.ready.gauge1` metric is also the internal NodeState, whereas the `cluster.total.state` metrics use the normal ready state.

## Caveats

Expand All @@ -96,5 +129,4 @@ Enable the `create-keyspace` on only one node, or leave it enabled for all, but

## Other
- use min-available-shards to control how many unavailable shards are tolerable
- use max-priority to control how much priority / data log is tolerable (note: lowest lag shards are preferred)
- note: currently if a shard fails, it doesn't retry other instance in the same request
6 changes: 3 additions & 3 deletions docs/http-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ POST /

returns:

* `200 OK` if the node is primary or a warmed up secondary (`warmupPeriod` has elapsed)
* `503 Service not ready` if the node is secondary and not yet warmed up.
* `200 OK` if the node is [ready](clustering.md#priority-and-ready-state)
* `503 Service not ready` otherwise.

#### Example

Expand Down Expand Up @@ -132,7 +132,7 @@ returns a json document with the following fields:
* "primary": whether the node is a primary node or not
* "primaryChange": timestamp of when the primary state last changed
* "version": metrictank version
* "state": whether the node is ready to handle requests or not
* "state": whether the node is [ready](clustering.md#priority-and-ready-state) to handle requests or not
* "stateChange": timestamp of when the state last changed
* "started": timestamp of when the node started up

Expand Down
5 changes: 3 additions & 2 deletions docs/startup.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@ The full startup procedure has many details, but here we cover the main steps if
| init Index | creates session, keyspace, tables, write queues, etc and loads in-memory index from persisted data | reasonable RAM and CPU increase |
| create cluster notifier | optional: connects to Kafka, starts backfilling persistence message and waits until done or timeout| if backfilling: above-normal CPU, normal RAM usage |
| start input plugin(s) | starts backfill (kafka) or listening (carbon, prometheus) and maintain priority based on input lag | if backfilling: above-normal CPU and RAM usage |
| mark ready state | immediately (primary) or after warmup period (secondary) (combined with priority for clustering) | no |
| mark ready state | immediately (primary) / after warmup (secondary) [details](clustering.md#priority-and-ready-state) | no |

We recommend provisioning a cluster such that it can backfill a 7 hour backlog in half on hour or less. This means:
* The CPU increase during the kafka backfilling is very significant: typically a 14x cpu increase compared to normal usage.
* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal.
* The RAM usage during the input data backfilling is typically about 1.5x to 2x normal,
though the `cluster.gc-percent-not-ready` setting lets you trade cpu for memory usage during startup.

Backfilling will go as fast as it can until it reaches a bottleneck (kafka brokers, cpu constraints, etc), so your numbers may vary.

Expand Down

0 comments on commit 7f79b18

Please sign in to comment.