Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions nats-concepts/jetstream/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,50 @@ Rather than defaulting to the maximum, we suggest selecting the best option base

JetStream also allows server administrators to easily mirror streams, for example between different JetStream domains in order to offer disaster recovery. You can also define a stream that 'sources' from one or more other streams.

**Syncing data to disk**

JetStream’s file-based streams persist messages to disk. However, under the default configuration, JetStream does not immediately `fsync` data to disk. The server uses a configurable `sync_interval` option, with a default value of 2 minutes, which controls how often the server will `fsync` its data. The data will be `fsync`-ed no later than this interval. This has important consequences for durability:

In a non-replicated setup, an OS failure may result in data loss. A client might publish a message and receive an acknowledgment, but the data may not yet be safely stored to disk. As a result, after an OS failure recovery, a server may have lost recently acknowledged messages.

In a replicated setup, a published message is acknowledged after it successfully replicated to at least a quorum of servers. However, replication alone is not enough to guarantee the strongest level of durability against multiple systemic failures.
- If multiple servers fail simultaneously, all due to an OS failure, and before their data has been `fsync`-ed, the cluster may fail to recover the most recently acknowledged messages.
- If a failed server lost data locally due to an OS failure, although extremely rare, it may rejoin the cluster and form a new majority with nodes that have never received or persisted a given message. The cluster may then proceed with incomplete data causing acknowledged messages to be lost.

Setting a lower `sync_interval` increases the frequency of disk writes, and reduces the window for potential data loss, but at the expense of performance. Additionally, setting `sync_interval: always` will make sure servers `fsync` after every message before it is acknowledged. This setting, combined with replication in different data centers or availability zones, provides the strongest durability guarantees but at the slowest performance.

The default settings have been chosen to balance performance and risk of data loss in what we consider to be a typical production deployment scenario across multiple availability zones.

For example, consider a stream with 3 replicas deployed across three separate availability zones. For the stream state to diverge across nodes would require that:
- One of the 3 servers is already offline, isolated or partitioned.
- A second server’s OS needs to be killed such that it loses writes of messages that were only available on 2 out of 3 nodes due to them not being `fsync`-ed.
- The stream leader that’s part of the above 2 out of 3 nodes needs to go down or become isolated/partitioned.
- The first server of the original partition that didn’t receive the writes recovers from the partition.
- The OS-killed server now returns and comes in contact with the first server but not with the previous stream leader.

In the end, 2 out of 3 nodes will be available, the previous stream leader with the writes will be unavailable, one server will have lost some writes due to the OS kill, and one server will have never seen these writes due to the earlier partition. The last two servers could then form a majority and accept new writes, essentially losing some of the former writes.

Importantly this is a failure condition where stream state could diverge, but in a system that is deployed across multiple availability zones, it would require multiple faults to align precisely in the right way.

A potential mitigation to a failure of this kind is not automatically bringing back a server process that was OS-killed until it is known that a majority of the remaining servers have received the new writes, or by peer-removing the crashed server and admitting it as a new and wiped peer and allowing it to recover over the network from existing healthy nodes (although this could be expensive depending on the amount of data involved).

For use cases where minimizing loss is an absolute priority, `sync_interval: always` can of course still be configured, but note that this will have a server-wide performance impact that may affect throughput or latencies. For production environments, operators should evaluate whether the default is correct for their use case, target environment, costs, and performance requirements.

Alternatively, a hybrid approach can be used where existing clusters still function under their default `sync_interval` settings but a new cluster gets added that’s configured with `sync_interval: always`, and utilizes server tags. The placement of a stream can then be specified to have this stream store data on this higher durability cluster through the use of [placement tags](streams.md#placement).
```
# Configure a cluster that's dedicated to always sync writes.
server_tags: ["sync:always"]

jetstream {
sync_interval: always
}
```

Create a replicated stream that’s specifically placed in the cluster using `sync_interval: always`, to ensure strongest durability only for stream writes that require this level of durability.
```
nats stream add --replicas 3 --tag sync:always
```

#### De-coupled flow control

JetStream provides decoupled flow control over streams, the flow control is not 'end to end' where the publisher(s) are limited to publish no faster than the slowest of all the consumers (i.e. the lowest common denominator) can receive but is instead happening individually between each client application (publishers or consumers) and the nats server.
Expand Down
2 changes: 1 addition & 1 deletion running-a-nats-service/configuration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ jetstream {
| `max_buffered_msgs` | Maximum number of messages JetStream will buffer in memory when falling behind with RAFT or I/O. Used to protect against OOM when there are write bursts to a queue. | 10.000 | 2.11.0 |
| `max_buffered_size` | Maximum number of bytes JetStream will buffer in memory when falling behind with RAFT or I/O. Used to protect against OOM when there are write bursts to a queue. | 128MB | 2.11.0 |
| `request_queue_limit` | Limits the number of API commands JetStream will buffer in memory. When the limit is reached, clients will get error responses rather than a timeout. Lower the value if you want to detect clients flooding JetStream. | 10.000 | 2.11.0 |
| `sync_interval` | Examples: `10s` `1m` `always` - Change the default fsync/sync interval for page cache in the filestore. By default JetStream relies on stream replication in the cluster to guarantee data is available after an OS crash. If you run JetStream without replication or with a replication of just 2 you may want to shorten the fsync/sync interval. - You can force an fsync after each messsage with `always`, this will slow down the throughput to a few hundred msg/s. | 2m | 2.10.0 |
| `sync_interval` | Examples: `10s` `1m` `always` - Change the default fsync/sync interval for page cache in the filestore. By default JetStream relies on stream replication in the cluster to guarantee data is available after an OS crash. If you run JetStream without replication or with a replication of just 2 you may want to shorten the fsync/sync interval. - You can force an fsync after each messsage with `always`, this will slow down the throughput to a few hundred msg/s. See also the documentaton about the interaction between the stream replication factor and syncing data to disk [here](/nats-concepts/jetstream/README.md#persistent-and-consistent-distributed-storage). | 2m | 2.10.0 |
| `strict` | Return errors for invalid JetStream API requests. Some older client APIs may not expect this. Set to `false` for maximum backward compatibility. | `true` | 2.11.0 |
| `unique_tag` | JetStream peers will be placed in servers with tags unique relative to the `unique_tag` prefix. E.g. nodes in a cluster (or supercluster) are tagged `az:1`,`az:1`,`az:2`,`az:2`,`az:3`,`az:3`,`az:3` . Setting `unique_tag=az` will result in a new replica 3 stream to be placed in all three availability zones. | (not set)) | 2.8.0 |
| `tpm` | Trusted Platform Module [TPM base encryption](#jetstream-tpm-encryption) | `tpm {}` (not set) | 2.11.0 |
Expand Down