Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
103 lines (64 sloc) 7.3 KB

Submission for Netflix OSS prize by:

justinsb, founder of FathomDB, USA

Which Categories Best Fit Your Submission and Why?

  • Best new monkey

I've added support for different types of chaos monkey beyond just shutting down the instance. I then implemented 14 new species of chaos monkey, simulating different types of failures. I've also made it really easy to create more chaos monkey species just by writing a script, which is run via JClouds/SSH.

Also:

  • Best usability enhancement

I've fixed the problems I encountered while setting up the Simian Army, including documenting how to use the new AWS CLIs.

  • Best new feature

I think that the idea of running sneakier chaos monkeys to simulate more unusual failures might qualify as a new feature.

  • Best contribution to operational tools, availability and manageability

I think that if Netflix can survive an onslaught by the army of the 15 monkeys of chaos, then it is much more likely to remain available throughout non-simulated problems.

Describe your Submission

There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy

Shakespeare, Hamlet

There are known knowns; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don't know. But there are also unknown unknowns – there are things we do not know we don't know.

Donald Rumsfeld, Press Briefing

The chaos monkey currently just shuts down a machine; it simulates a "clean failure". The chaos monkey was a huge leap forward by Netflix engineering, but we've seen in a few AWS outages now that large-scale AWS failures are not clean instance shutdowns. This is good - things fail in new ways at AWS cloud scale; we'd be upset if the same things kept going wrong. But we need the chaos monkey to simulate more types of failures if we want to more accurately simulate AWS failures. We've seen network connectivity issues; we've seen EBS issues; we've seen S3 go offline. I think the only thing we can be sure of is that the next failure will be different!

We'd ideally like to simulate the "unknown unknowns". That isn't possible per-se, but we can instead simulate a large variety of failure types. Code that is robust to a large variety of failures (ideally without specifically addressing them) will be much more likely to survive unknown unknowns.

So now, when the chaos monkey selects a victim instance, instead of just shutting it down, I've changed it so that it randomly picks a mischievous monkey from a barrel of chaos monkeys. I've added some truly evil species:

  • Simius Quies: Block all networking on the instance using EC2 security groups (network failure)
  • Simius Desertus: Null-route the 10.0.0.0 network (failure of the internal EC2 network)
  • Simius Perditus: Cause packet loss (degradation of EC2 network)
  • Simius Tardus: Add packet latency (degradation of EC2 network)
  • Simius Politicus: Cause packet corruption (degradation of EC2 network)
  • Simius Nonomenius: Block all DNS traffic (failure of DNS servers)
  • Simius Amnesius: Null-route S3 traffic (S3 failure)
  • Simius Noneccius: Null-route EC2 API traffic (EC2 control plane failure)
  • Simius Nodynamus: Null-route DynamoDB traffic (DynamoDB failure)
  • Simius Amputa: Disconnect all EBS volumes (EBS failure)
  • Simius Cogitarius: Burn CPU (noisy-neighbor / CPU issue)
  • Simius Occupatus: Burn disk I/O (noisy-neighbor / disk issue)
  • Simius Plenus: Fill up all available space on the root disk (disk full error or disk failure)
  • Simius Delirius: Kill all java & python processes on the machine repeatedly (general application fault)

These 14 new monkeys generally defeat simple Auto-Scaling-Group rules already, because the instance typically keeps running. I know that Netflix will have more sophisticated rules, but I feel pretty confident that at least one of these chaos monkeys will defeat Netflix's current rules, and that hopefully fixing this will mean I can keep enjoying Breaking Bad and House of Cards next time AWS has a problem!

Creating a new species of chaos monkey involves writing Java code (e.g. to use the EC2 APIs), but I have also made it simple to create a chaos monkey that runs on the instance alone - it can now just be a shell script (with a trivial wrapper class). To do that, I added JClouds integration, which has excellent SSH support built-in. The shell script is uploaded over SSH to the victim machine, and then executed. I think that "simplicity is a feature" - it's now so easy to create new chaos monkeys that I hope many more will be created, by Netflix, myself and others.

For example, I'd like to add a strategy that null routes Eureka based services, but I'm not sure of the exact DNS names to block. But, for Netflix, adding that is now a simple shell-script away.

Finally, to help other people get going with Netflix's Simian Army, I cleaned up a few gotchas that I hit. The wiki for getting going used the old AWS command line toolset, so I added the new shiny AWS CLI commands. There was an issue with strict parsing of boolean configuration values, so I fixed that after it bit me. The wiki instructions had a very complicated procedure for creating the required SimpleDB domain (it involved curl, openssl hashing for the signature, and supposedly didn't work reliably!) so I added code to just auto-create the SimpleDB bucket. And finally, the out-of-the-box config tried to send email to foo@bar.com; I added code so that we detect that email address and ignore it. These improvements have been merged in to the official Netflix repo (and the doc fixes are already on the wiki).

All code is Apache licensed. It has all been submitted via Pull Requests to the Netflix SimianArmy project; some PRs have already been accepted and merged into the official Netflix project in use at Netflix. The code passes the test suite, including CheckStyle tests, PMD tests and unit tests. It therefore follows the existing Netflix code style guidelines. Documentation has been added to the wiki, and existing installation documentation has been improved.

In terms of the categories:

  • Best new monkey

For adding support for multiple chaos monkeys; adding a barrel of 14 new chaos monkeys; integrating with JClouds and making adding a chaos monkey as easy as writing a script.

  • Best usability enhancement

For fixing the installation documentation and gotchas to let others get the Simian Army going quickly.

  • Best new feature

For recognizing that failures are not always clean, adding support for a variety of failures, and making it easy to add more.

  • Best contribution to operational tools, availability and manageability

For enhancing the chaos monkey to simulate a whole range of new failures, meaning that a huge variety of real world failures can be simulated, so that Netflix is more likely to remain available during the next incident.

Provide Links to Github Repo's for your Submission

My Simian Army fork: https://github.com/justinsb/SimianArmy

Some changes already merged in upstream, all changes are submitted in PRs: https://github.com/Netflix/SimianArmy

Wiki page with updated instructions (new CLI, auto-create SimpleDB domain): https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide

Wiki page with documentation on the new army of chaos monkeys: https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army