# Datacenter Outages

#### 70% of all outages are caused by human error__
* Siren noise damaging disk drives
* Technicians shutting down the wrong system

We'll follow with some examples of outages.

## AWS 
* 2011 Apr 21
* Down for >= 3.5 days
    * Took out Reddit, FourSquare, more

### Datacenter Model
* __Regions__ Datacenters
* __Availability Zones__ Different racks with a datacenter
    * Can be configured to replicate data between zones in a region
* __EBS__ Mountable storage "devices", accessible from EC2 instances
    * 1 EBS volume runs inside an AZ
    * Two networks:
        1. Primary used for EC2 and EBS *control plane* traffic
            * __control plane__ CRUD ops on volumes
        1. Secondary used for *overflow*
            * __overflow__ excess traffic from the primary
    * Control information (metadata) replicated across zones (availability)
* EBS volumes replicated for durability
    * Each volume has a primary replica
    * If out of sync or failure, replicas do an aggressive re-replication of data
    
### The story
* 0047: Routine primary network capacity upgrade in an us-east-1 AZ
    * Operator shifted off several primary n/w routers to other primary n/w routes
    * __Critical Error__ *Someone* (*Bob*) moved primary router traffic to a secondary n/w router
    * Secondary n/w routers can't handle primary traffic, so they become overwhelmed
    * Left many EBS volumes without a connection to their replica on the primary n/w
* Team discovered error and rolls it back
    * __Error 2__ Due to n/w partitioning, many replicas thought they had no replica
        * Began aggressive re-mirroring
        * Flooded of mirroring used up available n/w capacity
        * Began *looping*: Replicas were unable to verify replicas due to no n/w bandwidth, so they began trying to re-mirror elsewhere, deadlocking the n/w
            * 13% of all EBS volumes
    * This left no n/w capacity for Control Plane operations
        * Again, control plane is used for CRUD operations on volumes, not actual data transfer
        * Unable to serve "create volume" API requests for EBS
        * Control plane ops have a long time-out; began backing up on the queue
        * Once thread pool queue filled up, control plane begins rejecting "create volume" requests
        * First customer-facing sign of difficulties
* 0240: Team disables all "Create Volume" API requests
* 0250: Error rates and latencies for EBS APIs start to recover
    * Two things
        * Primaries searching for replicas *still* kept consuming n/w capacity
        * A race condition existed in EBS code
            * Only triggered in high request rates
            * Caused more node failures
* 0530: Error rates and latencies increase *again*
    * Some more background
        * EBS re-mirroring is a negotiation between an EC2 node, an EBS node, and the EBS control plane
            * race condition started causing EBS nodes to fail
            * negotiation rates increased
            * More EBS nodes failed as a result
            * "Brown-out" of EBS API functionalities
                * EBS isn't completely shut down, but large areas of functionality are going off-line
* 0820: Team starts disabling all communications between EBS cluster in affected AZ and EBS control plane
    * Shut down all EBS API operations in AZ
* 1130: Team learns how to prevent EBS servers in AZ from futile re-mirroring
    * AZ slowly recovering
* Customers still getting high error rates for new EBS-backed EC2 instances until 1200
    * A new EBS control plane API had recently been launched
    * Its error rates were being shadowed by the recent troubles
* 1200: No *new* volumes are getting stuck
    * 13% volume still stuck
* April 24 1200: All but 1.04% volumes had been recovered
    * 0.07% EBS volumes could *not* be recovered

### Lessons
* Errors often begin with human error
    * Operator moving primary n/w traffic to secondary n/w router
* Lollapalooza effet
    * Many other factors combine to cause a much *larger* situation
    
* Prevention
    * Step-by-step protocol for n/w upgrades
    * Higher capacity in secondary n/w
    * EBS back-off timeout instead of aggressive re-mirroring 
    * Fix the race condition in the code
    * Incentivize users to take advantage of multiple AZs within a region
    * Improve dashboard for displaying state of customer systems

## Facebook
* 2010 Sep 23


### Datacenter Model
* Data stored in a *persistent store*, and *cache*
    * __Persistent store__ many servers
    * __Cache__ many servers running a distributed cache system
* Automated system for verifying configuration data in cache
    * Automatically replaces invalid cache values from persistent store (store)
    
### The Story
* Sep 23: Invalid change saved to store
    * All cache servers saw invalid value
    * Flood of queries to DB cluster
        * 100K's QPS
* Team fixes invalid configuration in store
* But, when a cache server receives an error from the DB, it marks it as invalid and deletes the cache entry
    * Error here means failure to respond
    * Cache server sends more queries
    * Query escalation
* Turn off FB website
* Halt traffic to DB cluster
* Slowly allow users back online
* Took until later in day for site to return

### Lessons
* New config system design required
* Back off instead of aggressive retry when resource is unavailable.
    * Exponential backoff
        * Used in TCP, 802.11
        * Wait twice as long as last time
            * $t = 2 * t_{prev}$

## The Planet Outage

* 2008 May 31
* 4th largest web hosting company
* Hosted 22k website


* 1800: Explosion in H1 Houston DC