# System Designs Overview


## Web Service System for Millions of Users
![Final Million User System](Resources/FinalWebSrvr.JPG)

(One Data Center Shown as Example)
* **Load Balaner** 
    * Sends traffic to correct data center based upon GEO DNS routing
    * Distributes traffic envenly to web servers within a data center
    * Routes data only healthy web servers or data centers
* **CDN - Content Delivery Network**
    * Geographically closest server serves content
    * Can cache and deliver based upon request path, cookines, query strings, and headers
* **Stateless Webs Server Architecture**
    * Session state data stored in persisten data store
    * Allows for scaling web tier
    * Session state store across all data centers to re-route traffic due to outage
* **Data Tier follows a Master-Slave Replication design with Sharding**
    * Master database receives all writes, updates, and deletes
    * All data is copied to slaves which receive all read requests
    * Provides high availability and reliability
    * Data is sharded across multiple Master-Slave constellations
* **Message Queue Follows Pub-Sub Pattern for Task Processing**
* **Logging and Metrics Performed**

## Rate Limiter Design

![Rate Limiter](Resources/RateLimiter.JPG)

* **Rate Limiter Middleware**
    * Assuming microservice design, could also use rate limiter functionality in API Gateway
    * Uses Leaking Bucket rate limiting Algorithm (See Below)
        * Requests that are rate limited receive a HTTP 429 response with appropriate Headers
        * e.g. *X-Ratelimit-Remaining*, *X-Ratelimit-Limit*, *X-Ratelimit-Retry-After*
* **Counts stored in Cache**
* **Rules/Configuration for Rate Limiting Stored on Disk**
    * Workers pull from disk to update rules and store in/update cache
    * Allows for starting new rate limiters
* **Crucial to Have Monitoring of Rate Limiting Metrics** 
* **Multi-DataCenter or Distributed Environemnt Challenges**
    * Race Conditions for updating counters $\rightarrow$ sorted data sets structure
    * Synchronization of data between rate limiters $\rightarrow$ potentially use centralized data store and/or adopt eventual consistency
    
### Rate Limiting Algorithms
* **Token Bucket**
    * Parameters: bucket capactiy of *N* tokens and tokens are added to bucket at rate *R* (how many tokens per unit of time)
    * Each request needs a token for processing
    * If no tokens available, request is dropped
    * Can either have a global bucket or a bucket for each API endpoint
* **Leaking Bucket**
    * Parameters: FIFO queue size *S* and outflow rate *R* (number of requests to process in a fixed time unit)
    * Requests are put on the queue if there is room and processed at a fixed rate
    * Otherwise request is dropped
* **Fixed Window Counter**
    * Parameters: time window size *S* and max counter of *C*
    * Timeline is divided into windows of size *S* and each window can process upto *C* requests, with each request incrementing the counter that is initialized at 0
    * When counter is full for a window, requests are dropped
    * Issue: Can exceed *C* if there is a spike in requests at the edges of windows
* **Sliding Window Log**
    * Parameters: max request count *C* per unit of time window *T* (e.g. 60 seconds lookback)
    * Request timestamp stored in a cache and all requests older than current timestamp minus time window *T* are removed
    * If log/cache size is less than max request count *C*, request is processed
    * Else, the request is dropped but kept in the cache/log
* **Sliding Window Counter**
    * Hybrid between Fixed Window Counter and Sliding Window Log algorithms
        * Fixed window length *FW*
        * Rolling window length *RW*
        * Max requests *C*
        * Number of requests for a rolling window calculated using:    
        REQUESTS_IN_CURR_WINDOW + REQUESTS_IN_PREV_WINDOW * OVERLAP_PERCENT_OF_ROLLING_WINDOW_AND_PREV_WINDOW
        * Example: If 3 requests in current window, 7 in previous window, and the rolling window has a 70% overlap with previous window $\rightarrow 3 + 7\times .70 = 6.5$ 
    * Can round up or down the calculated max requests in a rolling window
    * Reject or accept requests if number is greater than *C*

## Consistent Hashing Design

![HashRing](Resources/HashRing.JPG)

* Naive hashing approach is: $hash(\text{key}) \% \text{NUM_SERVERS}$
    * Does account for servers being added or removed
    * Does not offer much flexibility for custom hashing
* Consistent Hashing Solves both
    * Basic Consistent Hashing
        * Create Hash Ring: If hash function $hash\_func$ has output domain $hash\_func \in [x_0, x_n]$, if you connect the endpoints, you have a hash ring
        * Hash servers by IP or name and map them onto Hash Ring
        * Hash keys for a request is mapped onto Hash Ring and first server on ring as you progress clockwise is where the requests gets assigned
    * Using Virtual nodes allows for more even key distribution and consistent partition sizes for each server
        * For each server $s_k$ on the hash ring, create $n$ virtual nodes $s_{k,n}$ that represent the actual server, e.g. $s_{k,1} \rightarrow s_k$
        * These virtual nodes are spread around the hash ring
        * When are request is mapped to the ring, the first virtual node going clockwise is the server that request is assigned to
* Adding or removing new servers with consistent hashing requires only those keys affected to be redistributed
    * Example: If servers $s0, s1, \& s2$ and new server $s3$ is added between $s2$ and $s0$, only the keys that fell between $(s2,  s0)$ are re-mapped to $s3$
    * Similar for removing a server: need to only to look at those keys on the hash ring prior to newly added or removed server and after the last server going counter-clockwise

## Key-Value Store Design

**Overview**  
![KeyValueDesign](Resources/KeyValueOverview.JPG)

* Clients connect via API for reads and writes through a coordinator node
* Nodes are distributed using consistent hashing and data is replicated
* All nodes have the same set of responsibilities: Client API, Conflict Resolution, Replication, Failure Detection, Failure Repair, Storage Engine, etc

**Write to Specific Node**   
![KeyValueDesign](Resources/KeyValueWrite.JPG)

* Write persisted to commit log file and saved in cache memory
* When cache reaches threshold, data is flushed to disk
    * Here a SSTable - sorted-string table   
    * Sorted list of key-value pairs      
$ $   

**Read from a Specific Node**   
![KeyValueDesign](Resources/KeyValueRead.JPG)

* Check if first in cache and if so return
* Checks bloom filter to determine which SSTable may have the data

Design leverages:   
* CP or AP gurantee, depending on business needs 
* Eventual consistency with a Vector clock for resolution
* The Gossip Protocol with a Sloppy Quorom and Hinted Handoff for node failure handling
* Merkle Tree anti-entropy approach for replica syncing for permanent node failure

## Associated Topics

* **CAP Theorem** - 
    * For distributed systems, a max of 2 of the below features are possible
        * **Consistency** - all clients see the same data at the same time no matter which node they connect to
        * **Availability** - any client which requests data gets a response even if some nodes are down
        * **Partition Tolerance** - the system continues to operate despite network partitions, which is a communication break between two nodes
    * Key-Value stores mostly support one of 2 guarantees
        * **CP** or **AP**
            * If node goes down, to maintain consistency writes are not allowed to remaining nodes so stale data is not returned
            * If node goes down, to maintain availability write are allowed, stale data is allowed to be returned, and data will be synced when the down node comes back up 
        * **CA** are not real-world realistic, since network failures are unavoidable 
* **Data Replication and Partition Using Consistent Hashing**
    * Consistency Models sync data across replicas across multiple nodes
        * **Quorom Consensus** 
            * *N* is the number of nodes, *W* is the write quorom size, and *R* is the read quorom
            * For a write to be successful, it must be acknowledged from *W* replicas
            * For a read to be successful, it must be responded from *R* replicas
            * *W* + *R* > N ensures strong consistency
        * **Strong Consistency** - A client never sees out-of-date data
        * **Weak Consistency** - Subsequent reads may not see most up-to-date data
        * **Eventual Consistency** - Given enough time, all updates are propagated and all replicas consistent
    * Client Inconsistency Resolution
        * For Eventual Consistency, concurrent writes may cause inconsistent values to be returned and the client must resolve 
        * **Versioning** - treat each data modification as a new immutable version of data
        * **Vector Clock** 
            * [*server*, *version*] pair assocaited with data
            * For writes/updates, a server increments $v_k$ if $[s_i, v_k]$ exists or creates a new entry $[s_i, 1]$
            * Clients detect if there is a conflict if all versions in another data item are not less or equal the particular data item
            * Can set a parameter that keeps only the *n* most recent pairs
* **Failure Detection and Handling**
    * Failure Detection
        * **All-to-All Multicasting**
            * All servers are connected to all servers
            * When two servers detect a server is down, it is marked as down
            * Inefficient 
        * **Gossip Protocol**
            * Each  node maintains a node membership list of node IDs and heartbeat counters
            * Each node periodically increments its heartbeat counter and sends count to a set of random nodes, which in turn propagate to another set of nodes
            * Membership list updated when heartbeats are received
            * If a member has not been update for a certain amount of time, marked as offline after verified by a certain number of nodes, e.g. 2
    * Handling Temporary Failures - After node is detected and marked down
        * **Strict Quorom** - read and write operations potentially blocked
        * **Sloppy Quorom**
            * Improves availability by not strictly enforcing the quorom requirement
            * Chooses first *W* healthy serves for writes and first *R* healthy servers for reads on hash ring
        * **Hinted Handoff**
            * Another server temporarily handles down server's traffic
            * When down server is up, changes pushed back to achieve data consistency
    * Handling Permanent Failures - Node cannot be recovered
        * Employ anti-entropy protocol to keep replicas in sync by updating each replica to the newest version through comparison
        * **Merkle Tree** does this efficiently
            * A hash tree where every non-leaf node is labeled with a hash of the labels or values of its child nodes (values are in the case where the child nodes are leaves)
            * Process:
                * Divide key space into *N* buckets
                * Hash each key in the a bucket and then hash all hashes of the keys to create a hash for the bucket
                * Build tree upward to root by calculating hashes of children
            * Replicas are compared first at root and if there is a difference, traverse tree until source of mismatch branch is found
                

