# Leader Election

## Examples
* Who is master for reads/writes in a NoSQL cluster?
* Who is the root server in a NTP hierarchy
* Who provides total ordering for multicasts, i.e. who's the Sequencer

### What's involved in an Election?
* A single, __non-faulty__ process elected
* Everyone gets told that this is the new leader
* Plus, you need to handle crashes at any point during this process

### System Model
* _N_ processes
* Each process has a unique ID
* Messages will eventually be delivered
* Failures can occur at any time

### Calling an election
* Any process can initiate
* At the end of an election,
    * All processes elect the same leader, or no processes have elected a leader (Safety)
    * There _is_ an end to the election, and a leader is elected (liveness)
    
### Voting
Before init'ing the system, an attribute to be used for ranking must be specified.
Common examples include IP address, hardware specifications (like processing power), most/least files

## Ring Election
All processes have a channel to their clockwise successor 
* Process _N_ talks to process _N+1 % N_

### Algorithm
* Any process who discovers the old leader has fallen initiates
    * Sends a message (ID:attr)
* Each subsequent process
    * Compares their attr to the received attr
    * If less, forwards the message
    * If greater, 
        * If it hasn't already forwarded a message, overwrite message with its own (id:attr)
    * If id, attr are equal to this processes'
        * This process is the new leader, because all processes before it agreed as much
        * Sends an "Elected" message around the ring

### Multiple Initiator
Can use the process ID as its own ranking.

* Each process caches initiator of Election/Elected messages
* Suppress (don't forward) lower-ID Election/Elected messages
* Update for new higher-ID Election/Elected messages.
* Only highest-ID messages will survive

### Failures
* What if the One True Leader crashes before receiving its own Election message? Then, no one stops forwarding the Election message around the ring

#### Predecessor Detection
* Predecessor/Successor can detect failure and re-initiate run
    * Can time-out waiting between Election & Elected
    * Or if it gets an (Elected:80) message, but knows 80 has failed
* __Trouble__
    * What if predecessor(s) keep failing?
    
#### Dedicated Failure Detector
* Any process can detect failure of would-be leader via its local failure detector, and thus re-initiate run
* __Trouble__
    * Incompleteness
        * Would-be leader's failure might be missed
    * Inaccuracy
        * False negative (Would-be leader mistakenly detected as failed)
            * New election will run forever

## Consensus & Leader Election
If we can solve Leader Election, we could solve Consensus
* Just use the elected processes last bit as a 1/0 flag
    * If every elects the leader, then everyone has also decided on the same 1/0 value!
* But Paxos can still be implemented to help.
* Google's Chubby & Apache's Zookeeper implement Paxos-like systems under-the-hood

## Google Chubby
__A system for locking__

### System Model
* Group of replicas
* One master amongst them at all times



### Algo

__Concept__
Leader gets quorum of votes
Leader has a "term" where it can be re-voted into office

* Potential leader tries to get votes
* Each server votes for at most one leader
* Server with quorum becomes leader
* __Safe__ With quorums, you can never have two leaders elected
* Concept of __leases__
    * After election, master has an election "term" where no other elections will be run
    * Can "renew" a lease by asking if it still has a majority
        * Avoids overhead of full election
    * Also guarantees a new election _will_ be run at end of lease

### Apache Zookeeper
Uses a Paxos-variant called _Zab_: __Z__ookeeper __A__tomic __B__roadcast

#### Algo
__Concept__
All servers throw their bid into a central location
Highest-bid server becomes leader

1. Each server creates a _sequence number_, which we'll refer to as IDs
1. Writes this ID into the Zookeeper file system _atomically_
1. Highest ID server becomes the new leader
* Handling ID conflicts & leader failures during election:
    1. Potential Leader broadcasts NEW_LEADER
    1. Each process ACKs to at most one NEW_LEADER
    1. Leader who receives quorum of ACKS broadcasts COMMIT
        * __Potentially, no Leader gets a Quorum!__ 
    1. On receipt of COMMIT, process(es) updates Leader pointer

1. __Failure Detection__
    1. Each process monitors next-highest ID process (successor)
    1. On detection of failure in successor
        1. If successor was the Leader
            1. You become the leader
        1. Else, 
            1. Wait until timeout
            1. Re-check successor

## Bully Algorithm

### Algo
__Concept__
All processes know each other, and have a ranking.
As leaders fail, the next-in-line process assumes Leadership, or a lower-ranked process asks higher-ranked processes until one assumes Leadership.

* All processes know each other's IDs (ranking)
* On Leader failure
    * If you're next-in-line,
        1. Assume Leadership
        1. Broadcasts COORDINATOR message
    * Else, you're not next-in-line: 
        1. Send ELECTION to all higher-ranked processes
        1. If no one responds within _timeout_:
            * Assume Leadership (broadcast COORDINATOR)
        1. Else, someone responded:
            * Wait for COORDINATOR message
            * If _timeout_ waiting:
                * Re-send ELECTION messages
* On ELECTION receipt:
    1. Send ELECTION messages to higher-ranked processes
        * If you receive OK:
            * Wait for COORDINATOR (wait for new LEADER)
        * Else, higher-ranked process(es) never responded
            * Broadcast COORDINATOR (become the LEADER)
            
### Run Time

* __Worst Case__
    * __Assumptions__
        * No failures during election
        * Lowest-ID process detects LEADER failure
    1. Lowest-ID broadcasts ELECTION to all higher-ID
        $$
        = N-1 + N-2 + ... + 1 = (N-1) \times \frac{N}{2} = O(N^{2})
        = Quadratic
        $$
    1. Would-be LEADER responds OK to lowest-ID process
    1. ELECTION from would-be LEADER to failed LEADER
    1. Would-be LEADER times out waiting for OK from failed LEADER
    1. Would-be LEADER broadcasts COORDINATOR (assumes Leadership)
**Total** = 5 message transmissions
* __Best Case__
    * Second-highest ID process detects LEADER failure
    * Sends N-2 COORDINATOR messages
**Total** = 1 message transmission time


* Since _timeouts_ are built into the system, and an asynchronous system can have indefinitely long delays, we may always timeout, thus not satisfying liveness.

## Themes to Leader Election
* Processes must be ranked
* Reliant on majorities