Skip to content
This repository has been archived by the owner on Dec 13, 2021. It is now read-only.

Commit

Permalink
Merge pull request #129 from osrg/suda/wip
Browse files Browse the repository at this point in the history
Major improvements (doc, CLI)
  • Loading branch information
AkihiroSuda committed Mar 31, 2016
2 parents a7defa0 + cc050d4 commit 938c58c
Show file tree
Hide file tree
Showing 15 changed files with 389 additions and 197 deletions.
133 changes: 117 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
[![GoDoc](https://godoc.org/github.com/osrg/earthquake/earthquake?status.svg)](https://godoc.org/github.com/osrg/earthquake/earthquake)
[![Build Status](https://travis-ci.org/osrg/earthquake.svg?branch=master)](https://travis-ci.org/osrg/earthquake)
[![Coverage Status](https://coveralls.io/repos/github/osrg/earthquake/badge.svg?branch=master)](https://coveralls.io/github/osrg/earthquake?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/osrg/earthquake)](https://goreportcard.com/report/github.com/osrg/earthquake)

Earthquake is a programmable fuzzy scheduler for testing real implementations of distributed system (such as ZooKeeper).

Blog: [http://osrg.github.io/earthquake/](http://osrg.github.io/earthquake/)

Earthquakes permutes C/Java function calls, Ethernet packets, Filesystem events, and injected faults in various orders so as to find implementation-level bugs of the distributed system.
Earthquakes permutes Java function calls, Ethernet packets, Filesystem events, and injected faults in various orders so as to find implementation-level bugs of the distributed system.
Earthquake can also control non-determinism of the thread interleaving (by calling `sched_setattr(2)` with randomized parameters).
So Earthquake can be also used for testing standalone multi-threaded software.

Expand All @@ -27,13 +28,13 @@ Basically, Earthquake permutes events in a random order, but you can write your
* Found [YARN-4301](https://issues.apache.org/jira/browse/YARN-4301) (fault tolerance): ([repro code](example/yarn/4301-reproduce))
* Reproduced flaky tests YARN-{[1978](https://issues.apache.org/jira/browse/YARN-1978), [4168](https://issues.apache.org/jira/browse/YARN-4168), [4543](https://issues.apache.org/jira/browse/YARN-4543), [4548](https://issues.apache.org/jira/browse/YARN-4548), [4556](https://issues.apache.org/jira/browse/YARN-4556)} ([repro instruction](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42))

## Quick Start
The following instruction shows how you can start *Earthquake Container*, the simplified CLI for Earthquake.
## Quick Start (Container mode)
The following instruction shows how you can start *Earthquake Container*, the simplified, Docker-like CLI for Earthquake.


$ sudo apt-get install libzmq3-dev libnetfilter-queue-dev
$ go get github.com/osrg/earthquake/earthquake-container
$ sudo earthquake-container run -it --rm ubuntu bash
$ sudo earthquake-container run -it --rm -v /foo:/foo ubuntu bash


In *Earthquake Container*, you can run arbitrary command that might be *flaky*.
Expand All @@ -59,6 +60,11 @@ explorePolicy = "random"
# Default: 0 and 0
minInterval = "80ms"
maxInterval = "3000ms"

# for Ethernet/Filesystem inspectors, you can specify fault-injection probability (0.0-1.0).
# Default: 0.0
faultActionProbability = 0.0

# for Process inspector, you can specify how to schedule processes
# "mild": execute processes with randomly prioritized SCHED_NORMAL/SCHED_BATCH scheduler.
# "extreme": pick up some processes and execute them with SCHED_RR scheduler. others are executed with SCHED_BATCH scheduler.
Expand All @@ -76,31 +82,125 @@ explorePolicy = "random"
```
For other parameters, please refer to [`config.go`](earthquake/util/config/config.go) and [`randompolicy.go`](earthquake/explorepolicy/random/randompolicy.go).

If you don't want to use containers, you can also use Earthquake (process inspector) with an arbitrary process tree.

## Quick Start (Non-container mode)
If you don't want to use containers, please use the `earthquake` command directly.

$ sudo apt-get install libzmq3-dev libnetfilter-queue-dev
$ go get github.com/osrg/earthquake/earthquake
$ sudo earthquake inspectors proc -root-pid $TARGET_PID -watch-interval 1s -autopilot config.toml

For Ethernet inspector,
### Process inspector

$ iptables -A OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
$ sudo earthquake inspectors ethernet -nfq-number 42 -autopilot config.toml
$ sudo -u johndoe $TARGET_PROGRAM
$ iptables -D OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
$ sudo earthquake inspectors proc -root-pid $TARGET_PID -watch-interval 1s

By default, all the processes and the threads under `$TARGET_PID` are randomly scheduled.

You can also specify a config file by running with `-autopilot config.toml`.

You can also set `-orchestrator-url` and `-entity-id` for distributed execution.

Note that the process inspector may be not effective for reproducing short-running flaky tests, but it's still effective for long-running tests: [issue #125](https://github.com/osrg/earthquake/issues/125).


The guide for reproducing flaky Hadoop tests (please use `earthquake` instead of `microearthquake`): [FOSDEM slide 42](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42).

For Filesystem inspector,

### Filesystem inspector (FUSE)

$ mkdir /tmp/{eqfs-orig,eqfs}
$ sudo earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs -autopilot config.toml
$ sudo earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs
$ $TARGET_PROGRAM_WHICH_ACCESSES_TMP_EQFS
$ sudo fusermount -u /tmp/eqfs

For full-stack (fully-distributed) Earthquake environment, please refer to [doc/how-to-setup-env-full.md](doc/how-to-setup-env-full.md).
By default, all the `read`, `mkdir`, and `rmdir` accesses to the files under `/tmp/eqfs` are randomly scheduled.
`/tmp/eqfs-orig` is just used as the backing storage.

You can also inject faullts (currently just injects `-EIO`) by setting `explorePolicyParam.faultActionProbability` in the config file.

### Ethernet inspector (Linux netfilter_queue)

$ iptables -A OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42
$ sudo earthquake inspectors ethernet -nfq-number 42
$ sudo -u johndoe $TARGET_PROGRAM
$ iptables -D OUTPUT -p tcp -m owner --uid-owner $(id -u johndoe) -j NFQUEUE --queue-num 42

By default, all the packets for `johndoe` are randomly scheduled (with some optimization for TCP retransmission).

You can also inject faults (currently just drop packets) by setting `explorePolicyParam.faultActionProbability` in the config file.

### Ethernet inspector (Openflow 1.3)

You have to install [ryu](https://github.com/osrg/ryu) and [hookswitch](https://github.com/osrg/hookswitch) for this feature.

$ sudo pip install ryu hookswitch
$ sudo hookswitch-of13 ipc:///tmp/hookswitch-socket --tcp-ports=4242,4243,4244
$ sudo earthquake inspectors ethernet -hookswitch ipc:///tmp/hookswitch-socket

[The slides for the presentation at FOSDEM](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/42) might be also helpful.
Please also refer to [doc/how-to-setup-env-full.md](doc/how-to-setup-env-full.md) for this feature.

### Java inspector (AspectJ, byteman)

To be documented

### Distributed execution

Basically please follow these examples: [example/zk-found-2212.ryu](example/zk-found-2212.ryu), [example/zk-found-2212.nfqhook](example/zk-found-2212.nfqhook)

#### Step 1
Prepare `config.toml` for distributed execution.
Example:
```toml
# executed in `earthquake init`
init = "init.sh"

# executed in `earthquake run`
run = "run.sh"

# executed in `earthquake run` as the test oracle
validate = "validate.sh"

# executed in `earthquake run` as the clean-up script
clean = "clean.sh"

# REST port for the communication.
# You can also set pbPort for ProtocolBuffers (Java inspector)
restPort = 10080

# of course you can also set explorePolicy here as well
```

#### Step 2
Create `materials` directory, and put `*.sh` into it.

#### Step 3
Run `earthquake init --force config.toml materials /tmp/x`.

This command executes `init.sh` for initializing the workspace `/tmp/x`.
`init.sh` can access the `materials` directory as `${EQ_MATERIALS_DIR}`.

#### Step 4
Run `for f in $(seq 1 100);do earthquake run /tmp/x; done`.

This command starts the orchestrator, and executes `run.sh`, `validate.sh`, and `clean.sh` for testing the system (100 times).

`run.sh` should invoke multiple Earthquake inspectors: `earthquake inspectors <proc|fs|ethernet> -entity-id _some_unique_string -orchestrator-url http://127.0.0.1:10080`

`*.sh` can access the `/tmp/x/{00000000, 00000001, 00000002, ..., 00000063}` directory as `${EQ_WORKING_DIR}`, which is intended for putting test results and some relevant information. (Note: 0x63==99)

`validate.sh` should exit with zero for successful executions, and with non-zero status for failed executions.

`clean.sh` is an optional clean-up script for each of the execution.

#### Step 5
Run `earthquake summary /tmp/x` for summarizing the result.

If you have [JaCoCo](http://eclemma.org/jacoco/) coverage data, you can run `java -jar bin/earthquake-analyzer.jar --classes-path /somewhere/classes /tmp/x` for counting execution patterns as in [FOSDEM slide 18](http://www.slideshare.net/AkihiroSuda/tackling-nondeterminism-in-hadoop-testing-and-debugging-distributed-systems-with-earthquake-57866497/18).

![doc/img/exec-pattern.png](doc/img/exec-pattern.png)

## Talks

* [CoreOS Fest](http://sched.co/6Szb) (May 9-10, 2016, Berlin)
* [ApacheCon Core North America](http://events.linuxfoundation.org/events/apachecon-north-america/program/schedule) (May 11-13, 2016, Vancouver)
* [FOSDEM](https://fosdem.org/2016/schedule/event/nondeterminism_in_hadoop/) (January 30-31, 2016, Brussels)
* The poster session of [ACM Symposium on Cloud Computing (SoCC)](http://acmsocc.github.io/2015/) (August 27-29, 2015, Hawaii)
Expand All @@ -116,7 +216,8 @@ Released under [Apache License 2.0](LICENSE).

---------------------------------------

## API Overview
## API for your own exploration policy

```go
// implements earthquake/explorepolicy/ExplorePolicy interface
type MyPolicy struct {
Expand Down
Binary file added doc/img/exec-pattern.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions earthquake-container/cli/run/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,34 @@ func prepare(args []string) (dockerOpt *docker.CreateContainerOptions, removeOnE
return
}

func help() string {
// FIXME: why not use the strings in runflag.go?
s := `Usage: earthquake-container run [OPTIONS] IMAGE COMMAND
Run a command in a new Earthquake Container
Docker-compatible options:
-d, --detach [NOT SUPPORTED] Run container in background and print container ID
-i, --interactive Keep STDIN open even if not attached
--name Assign a name to the container
--rm Automatically remove the container when it exits
-t, --tty Allocate a pseudo-TTY
-v, --volume=[] Bind mount a volume
Earthquake-specific options:
-eq-config Earthquake configuration file
NOTE: Unlike docker, COMMAND is mandatory at the moment.
`
return s
}

func Run(args []string) int {
dockerOpt, removeOnExit, eqCfg, err := prepare(args)
if err != nil {
// do not panic here
fmt.Fprintf(os.Stderr, "%s\n", err)
fmt.Fprintf(os.Stderr, "\n%s\n", help())
return 1
}

Expand Down
4 changes: 2 additions & 2 deletions earthquake/cli/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -244,15 +244,15 @@ type initCmd struct {
}

func (cmd initCmd) Help() string {
return "init help (todo)"
return "Please run `earthquake --help run` instead"
}

func (cmd initCmd) Run(args []string) int {
return _init(args)
}

func (cmd initCmd) Synopsis() string {
return "Initialize storage directory"
return "Initialize the workspace for \"run\" command"
}

func initCommandFactory() (cli.Command, error) {
Expand Down
64 changes: 55 additions & 9 deletions earthquake/cli/inspectors.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,57 @@ type inspectorsCmd struct {
}

func (cmd inspectorsCmd) Help() string {
// FIXME: much more helpful help string
return `
Earthquake Inspectors
- proc: Process inspector
- fs: Filesystem inspector
- ethernet: Ethernet inspector
NOTE: this binary does NOT include following inspectors:
- Java Inspector: (included in earthquake/inspector/java)
- C Inspector: (included in earthquake/inspector/c)
`
The inspectors command starts an Earthquake inspector.
If -orchestrator-url is set, the inspector connects the external orchestrator.
For how to start the external orchestrator, please refer to the help of the run command.
(earthquake --help run)
Note that you have to set -entity-id to an unique value if you connect multiple inspectors to the external orchestrator.
If -orchestrator-url is not set, the inspector connects the embedded orchestrator.
You can specify the configuration file for the embedded orchestrator by setting -autopilot <config.toml>.
Process inspector (proc)
Inspects running Linux process information, and set scheduling attributes.
Typical usage: earthquake inspectors proc -root-pid 42 -watch-interval 1s
Event signals: ProcSetEvent
Action signals: ProcSetSchedAction
Filesystem inspector (fs)
Inspects file access information, and inject delays and faults.
Implemented in FUSE.
Typical usage: earthquake inspectors fs -original-dir /tmp/eqfs-orig -mount-point /tmp/eqfs
Event signals: FilesystemEvent
Action signals: EventAcceptanceAction, FilesystemFaultAction
Ethernet inspector (ethernet)
Inspects Ethernet packet information, and inject delays and faults.
Implemented in Linux netfilter / Openflow.
For Openflow implementation, you have to install hookswitch: https://github.com/osrg/hookswitch
Typical usage: earthquake inspectors ethernet -nfq-number 42
Event signals: PacketEvent
Action signals: EventAcceptanceAction, PacketFaultAction
NOTE: this binary does NOT include the following inspectors:
Java Inspector: (included in misc/inspector/java)
C Inspector: (included in misc/inspector/c, NOT MAINTAINED)
NOTE: Python implementation for Ethernet inspector is also available in misc/pyearthquake.
You can also implement your own inspector in an arbitrary language.
`
}

func (cmd inspectorsCmd) Run(args []string) int {
Expand All @@ -47,6 +88,11 @@ func (cmd inspectorsCmd) Run(args []string) int {
"fs": inspectors.FsCommandFactory,
"ethernet": inspectors.EtherCommandFactory,
}
c.HelpFunc = func(commands map[string]mcli.CommandFactory) string {
s := (mcli.BasicHelpFunc("earthquake inspectors"))(commands)
s += cmd.Help()
return s
}

exitStatus, err := c.Run()
if err != nil {
Expand Down
Loading

0 comments on commit 938c58c

Please sign in to comment.