Jepsen testing Onyx. Work in progress
We wrote a blog post describing our experience using Jepsen: Onyx Straps in For a Jepsening
Set onyx dependency versions for the peers in project.clj. Snapshot versions are acceptable, but be sure to lein install them before running your tests as you may end up downloading a snapshot jar from clojars.
If not using Linux, install Docker Machine.
Then create a new "machine":
VMware Fusion instructions
Tune disk size, memory size and cpu counts to taste.
docker-machine create --driver vmwarefusion --vmwarefusion-disk-size 50000 --vmwarefusion-memory-size 20000 --vmwarefusion-cpu-count "6" jepsen-onyx
docker-machine create --driver virtualbox --virtualbox-disk-size 50000 --virtualbox-memory 20000 --virtualbox-cpu-count 4 jepsen-onyx
- Set docker-machine env:
eval "$(docker-machine env jepsen-onyx)"
- Uberjar peers and start docker in docker instance:
- Run from inside docker in docker.
Where TEST_NS is currently either
When running a new test, exit the docker instance, and restart the process from 4. The docker containers have everything setup perfectly so that nothing needs to be downloaded or installed before running a test. The jepsen test does not clean up after itself so a new container must be started before running a new test.
onyx-jepsen uses a custom jepsen docker image built specifically to test Onyx. This includes pre-installed ZooKeeper. See the README in the docker directory for more details.
Uses peers with the following configuration to avoid resource starvation running on a single machine:
-D"aeron.threading.mode=SHARED" -server -XX:+UseG1GC
See script/run-peers.sh for settings.
Jepsen Memorial Box
A memorial to those bugs destroyed by Jepsen, or at large, so far:
- Peer join race condition #453 Resolved.
- [Peers that crash on component/start will not reboot #437] (https://github.com/onyx-platform/onyx/issues/437) Resolved.
- [Ensure peer restarts after ZooKeeper connection loss/errors #423] (https://github.com/onyx-platform/onyx/issues/423) Resolved.
- [BookKeeper state log / key filter interaction issue #382] (https://github.com/onyx-platform/onyx/issues/382) Known, but theoretical issue, proven to be an issue by jepsen. Resolved.
- [Failed async BookKeeper writes should cause peer to to restart #390] (https://github.com/onyx-platform/onyx/issues/390) Known issue, but shown to be resolved by jepsen.
- Boot out dead peers when the replica doesn't reflect the cluster state. #526 Kill -9 test
- [Handle case where peer is restored, but all messages fully acked #4] (https://github.com/onyx-platform/onyx-bookkeeper/issues/4) Unresolved, low priority.
- [Plugin should wait until producer channel has completely finished #3] (https://github.com/onyx-platform/onyx-bookkeeper/issues/3) Resolved.
- Plugins using producer threads must be able to pass exceptions back to task #435 Resolved in onyx-bookkeeper, also fixed in onyx-datomic, onyx-seq, onyx-kafka.
Copyright © 2015 Distributed Masonry LLC
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.