Jepsen tests for the etcd consensus system! To run these, you'll need a Jepsen cluster.
To run a full test suite, run
lein run test-all --concurrency 2n --rate 1000
You can focus in on particular combinations of faults, or particular workloads:
lein run test-all --concurrency 2n --workload append lein run test-all --concurrency 2n --nemesis kill,partition
You can also run particular workloads and nemeses with
lein run test. For instance, to demonstrate lost updates to a set protected by an etcd lock, try:
lein run test --workload lock-set --nemesis pause --time-limit 120
append appends values to keys and reads them back, using transactions to
perform multiple reads/appends atomically. Each append is implemented by
reading the current value of the key from etcd, and writing it back using a
compare guard on the observed revision, to ensure nobody has changed it during
lock-set uses an etcd lock to protect access to a shared in-memory set stored
on the local control node, inside Jepsen itself.
lock-etcd-set does the same,
but stores the set in etcd. Both of these are unsafe, because etcd locks are
set appends unique elements to a single key using compare-and-set transactions, and tries to read the set back.
register tests for linearizable reads, writes, and compare-and-set operations
on named registers. Each register is checked independently, since there are no
cross-register transactions. The checker is aware of revision information as
well as the actual value of the register.
watch has processes observe change events on a single key as it goes through
various revisions, then confirms that everyone observed the same state
transitions in the same order. Watch has a tendency to get stuck if we clobber
the cluster too badly, which may result in indefinite results.
partition are the usual Jepsen nemeses for pausing/killing etcd processes, adjusting and strobing the clock, and partitioning the network.
member adds and removes members dynamically.
bitflip-snap induce single-bit errors in the WAL and snapshot files on disk.
--concurrency 4n says "run 4 worker threads per node".
--lazyfs mounts the etcd data directory in a lazyfs FUSE filesystem, and alters the
kill nemesis to also lose un-fsynced writes.
--ops-per-key controls how many operations we perform on a single key.
--nemesis FAULTS takes a comma-separated list of faults to inject.
--nemesis-interval SECONDS sets the approximate time between nemesis operations.
--rate HZ sets the rough number of operations per second Jepsen tries to perform.
--serializable allows etcd to serve stale data.
--snapshot-count N controls the snapshot count parameter to etcd; we lower this to 100 by default to stress the snapshot mechanism.
--tcpdump captures client traffic to each node in a pcap file.
--unsafe-no-fsync tells etcd to not perform fsyncs. This should be unsafe.
--time-limit SECONDS controls how long to run the test for.
Copyright © 2019, 2020, 2022 Jepsen, LLC
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.