Skip to content

Distributed System Design Workshop

kimschles edited this page Mar 26, 2019 · 1 revision

SRE Classroom: How to Design a Distributed System in Three Hours

SRECon 2019

Basics of Non-Abstract Large System Design

  • Requirements and Scaling
    • ID SLIs and SLOs
    • Example: 99%ile of queries returns valid result within 100ms
    • Service Level Agreement (SLA) is the contract containing all relevant SLOs and the punishment if it's violated
  • Scaling via Microservices
    • Replace a monolith with distinct microservices
    • If you put a load balancer in front of your microservices, it is easier to scale horizonally
    • Other types of scaling:
      • geographical (more physical locations)
      • functional (different feature sets)
  • Dealing with Loss and Failure
    • Failure is not an option, it's a certainty
    • Cloud providers offer some ready-made solutions; they handle some failures
    • Decouple: spread responsibilities across multiple processes
    • Avoid global changes: use a multi-tiered canary
    • Spread risk: don't depend on one backend
    • Degrade gracefully: keep serving if configs are corrupt or fail to push
    • Achieving Reliability: run n + 2 geographically distributed
      • n = deployment large enough to deal with standard load
      • Why +2?
        • Planned maintenance
        • Unplanned maintenance
  • Keeping State and Data
    • Useful for consistency, performance and reliability
    • Regardless of the amount of data, it all comes down to global consensus
    • Find authoritative instances of other services (leader election)
    • Always prefer stateless. It's easier.
    • CAP: consistency, availability and partition resilience
    • Networks are not reliable, but partitions are rare
    • Hot data and hotspotting
      • Some data is accessed more frequently than others
      • Frequent access to the same data can cause servers to overload
      • Capacity vs. performance cache. Capacity cache can be dangerous
      • Should not form part of the request-per-second capacity guarantee for a service
  • Non-abstract design
    • For each microservice, consider:
      • Disk I/O
      • QPS
      • Network bandwidth

Problem Statement

Clone this wiki locally