Distributed System Design Workshop

SRE Classroom: How to Design a Distributed System in Three Hours

SRECon 2019

Outline of workshop

Basics of Non-Abstract Large System Design

Requirements and Scaling
- ID SLIs and SLOs
- Example: 99%ile of queries returns valid result within 100ms
- Service Level Agreement (SLA) is the contract containing all relevant SLOs and the punishment if it's violated
Scaling via Microservices
- Replace a monolith with distinct microservices
- If you put a load balancer in front of your microservices, it is easier to scale horizonally
- Other types of scaling:
  - geographical (more physical locations)
  - functional (different feature sets)
Dealing with Loss and Failure
- Failure is not an option, it's a certainty
- Cloud providers offer some ready-made solutions; they handle some failures
- Decouple: spread responsibilities across multiple processes
- Avoid global changes: use a multi-tiered canary
- Spread risk: don't depend on one backend
- Degrade gracefully: keep serving if configs are corrupt or fail to push
- Achieving Reliability: run n + 2 geographically distributed
  - n = deployment large enough to deal with standard load
  - Why +2?
    - Planned maintenance
    - Unplanned maintenance
Keeping State and Data
- Useful for consistency, performance and reliability
- Regardless of the amount of data, it all comes down to global consensus
- Find authoritative instances of other services (leader election)
  - RAFT
- Always prefer stateless. It's easier.
- CAP: consistency, availability and partition resilience
- Networks are not reliable, but partitions are rare
- Hot data and hotspotting
  - Some data is accessed more frequently than others
  - Frequent access to the same data can cause servers to overload
  - Capacity vs. performance cache. Capacity cache can be dangerous
  - Should not form part of the request-per-second capacity guarantee for a service
Non-abstract design
- For each microservice, consider:
  - Disk I/O
  - QPS
  - Network bandwidth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed System Design Workshop

SRE Classroom: How to Design a Distributed System in Three Hours

Basics of Non-Abstract Large System Design

Problem Statement

Clone this wiki locally