## Global Snapshots

* You've got apps/services running on many servers
* It's handy to be able to definitively say what the state of the system is at a given moment in time
    * Checkpointing: Return to a given state on failure
    * Garbage colection: Remove objects at server that have been "orphaned"
    * Detect deadlocks in database systems
    * Figure out if you're done running computations: Think Folding@Home
    
What is a global snapshot then?
* The state of each process + the state of each _communication channel_ in the distributed system
    * Communication channel means getting any messages in transit!

__A dumb solution:__
* Synchronize all clocks
* Have each process report its state at time _t_
* Why doesn't that work?
    * Hard to synchronize clocks (NTP, Cristian's algo
    * Information loss due to clock skew/drift
    * Can't escape error
    * Doesn't capture state of messages in transit
    * Synchronization _isn't_ required - causal relationships are sufficient

Any single event occurring is enough to change __global state__

### Chandy-Lamport Global Snapshot Alogrithm

__Problem:__ Record the state for each process and each channel in a system

### Assumptions:
* _N_ processes in system
* Two channels for every pair of processes in the system; ($P_{i}$, $P_{j}$)
* Communication channels are FIFO
* No failures in the system (theoretical)
* All messages arrive, and without duplicates.

### Requirements

* Doesn't interfere with system operations
* Each process records its own state
    * App-defined state, or
    * Heap, registers, prog counter, code (core dump)
* State collected in a distributed manner
* Any process can initiate the snapshot

### Algo
1. Initiator $P_{i}$ records its own state
1. $P_{i}$ creates special messages called "Marker" mesages
    * Not like app messages, don't interfere
1. $P_{i}$ sends out Marker messages on each _outgoing_ channel from itself to every other process
    * Begins recording incoming messages on all the channels as well

As processes receive these Marker messages:

1. If this is the first Marker message they've seen: 
    1. They record their own state
    1. Marks the state of channel it received the Marker message from as "empty"
    1. Sends Marker messages out on every other channel, 
        * To be explicit: besides the one it just received the Marker on and marked as "empty", okay you get the point.
1. Else, they've already gotten a Marker message, so they:
    1. Record the state of incoming channels as _all the messages received on that channel since it began recording_
    1. Freeze recording on the channel they received the duplicate from
    
Algo terminates once:
1. All processes have received a Marker
1. All processes have received a Marker on all the incoming channels
1. A central server may collect the recorded states of each process, and aggregate it into the full snapshot

__Summary__
1. One process initiates recording, and tells every other process to do the same
1. Each process records messages on their communication channels, 
    1. They mark the channel they received the Marker from as "empty" 
    
    1. They stop recording once they receive a duplicate "Initiate Recording" message from the far side process

### Cuts

__Cut__ Time frontier that determines if processes & communications are "in the cut" or "out of the cut" 
(Before or equal to the time frontier or occurring after the time frontier)

* A __consistent cut__ includes senders and may not include receivers
* An __inconsistent cut__ includes a receiver without including the sender.

__Any run__ of the _Chandy-Lamport Alogrithm_ produces a consistent cut

* Let $e_{i}, e_{i}$ be events
* If $e_{i} \rightarrow e_{j}$, and $e_{j}$ is in the cut, then $e_{i}$ is in the cut

Or in more mathy-proofy terms: 
* If $e_{j}$ occurs before $P_{j}$ records its own state, then it follows that $e_{i}$ occurs before $P_{i}$ records its own state



## Safety & Liveness

Safety & liveness are "similar opposites" of _desired properties_; we can't actually guarantee (without extensive mathematical proofs) these qualities, but we can want them.

__Safety__ Guarantee that something _bad_ will _never_ happen
    * A peace treaty means war will never happen
    * Legal system: Innocent people don't go to jail
    * Computation: No deadlocks in a dstributed system
    * no orphaned objects
    * Failure Detection: No false positive failures detected; accuracy
    * Consensus: No two processes decide on two different values 
__Liveness__ Guarantee that something _good_ will happen, _eventually_
    * At least one of the athletes in the 100 dash will win gold"
    * Legal System: A criminal will be jailed
    * Computation: Guarantee that a distributed computation will terminate
    * Failure detection: Every failure will eventually be detected in a system; Completeness
    * Consensus: A "consensus" will eventually be reached 
    
Difficult to provide both in distributed systems; as you can see, they are often at odds, like with Failure Detection. You have to choose one over the other.

In Consensus: 
    * You can ensure you reach a decision (Liveness), or you can ensure you make the right one (Safety), but having both in a time-bounded system is nigh impossible.
    
    
### Prove it

__Liveness__

* The Liveness property we want will be $P_{r}$
* Current state is _S_ (Think "global state")
* A state that satisfies $P_{r}$ is _S'_
* If _S_ satisfies $P_{r}$, _or_ if there is a _causal path_ from _S_ to _S'_ w.r.t. $P_{r}$, we can satisfy that liveness property under the current state
* If this is true for all possible states _S_, then we can ensure $P_{r}$ globally

__Safety__

* The Safety property we want will be $P_{r}$
* Current state is _S_, and it satisfies $P_{r}$
* _S'_ is any global state reachable from _S_
* If all values of _S'_ satisfy $P_{r}$, we can ensure $P_{r}$ globally

* Chandy-Lamport Algo can detect _stable_ global properties
* Since it's causall correctness, it is guaranteed to be correct
* __stable__ Once true, true forever
    * Liveness Example: 
        * Once computation has terminated, it is true forever
    * Safety:
        * Detection of deadlocks
        * Orphaned objects
        
### Snapshots: In summary

* Don't want the process of taking a snapshot to disrupt normal operation
* Chandy-Lamport can calculate causally correct global snapshots
* C-L Algo can calculate global snapshots.
* A safety property states that a certain bad thing won't happen in your system
* A liveness property states that a certain good thing will always happen in your system