Skip to content

SRECon19 Day1

kimschles edited this page Mar 25, 2019 · 1 revision

Day 1: Comprehension, Understandability and Predictability

What Breaks our Systems: A Taxonomy of Black Swans by Laura Nolan of Slack

  • A black swan is an outlier event, it's hard to predict and has a severe impact
  • A term from Nassem Talib
  • White swans are 'easy' to resolve
  • Every black swan is unique, but there are patterns
    • Hitting limits
    • Spreading slowness
    • Thundering herds
    • Automation interactions
    • Cyber attacks
    • Dependency problems

Hitting Limits

  • Physical system limits
  • Defense: load and capacity testing
    • Include cloud services (let your provider know)
    • Include write loads
      • Use a replica of prod
      • Go beyond your current size
  • Defense: monitoring
    • When the monitor is triggered, have instructions about what to do to fix it

Spreading Slowness

  • Defense: fail fast
    • Enforce deadlines for all requests - in and out
    • Consider circuit breaker pattern
      • Limit retries from a client
  • Defense: Dashboards
    • Problem: some resource is saturated
    • Track this on a dashboard: utilization, saturation and errors

Thundering Herds

  • Can be users
  • More often from systems: cron jobs, mobile clients updating at once, large batch jobs
  • Defense: plan and test
    • any internet-facing service can face a thundering herd
    • Plan for degraded modes, which requests can be dropped?, queing input that can be processed asynchronously

Automation Interactions

  • Defense: control
    • Create constraints to limit automation operations
    • Provide ways to disable automation
    • All automation should log to one searchable place hard

Cyberattacks

  • Defense: smaller blast radius
    • Separate prod from non-prod as much as possible
    • Break prod systems in multiple zones

Dependency Problems

  • Defense: layer and test
    • Layer your infrastructure
    • Regularly test the process of starting your infrastructure up
    • Beware of soft dependencies: they can easily become Further Reading:
  • Release It! by Michael T Nygards

Complexity: The Crucial Ingredient in Your Kitchen by Casey Rosenthal of Verica.io

Question: how do we make systems reliable?

Challenger Case Study

  • Functionality has redundancy
  • Deviation is experience based (works on my machine)
  • Issue is self-limiting

Avoiding Risk

  • Exposure to risk is how we learn to deal with it, so don't avoid risk

Simplicity

  • Accidental complexity: it is added gradually over time
  • Essential complexity: it is there on purpose

Economic Pillars of Complexity

  • States (adding features)
  • Relationships (micro-services and k8s increases the number of relationships)
  • Environment (cloud provider or on prem?)
  • Reversibility (build features in chunks so you can rollback)

Software Engineering: the Bureaucratic Profession

  • Our industry separates who decides what will be done and who builds the project Think of a well-run kitchen
  • Lots of tasks are

tl;dr

  • Embrace complexity and navigate it
  • Provide opportunities for teams to practice working together
  • Tolerate inefficiencies

Case Study: Implementing SLOs for a New Service by Arnaud Lawson of SquareSpace

Definitions

  • Ceph Object Storage (COS)
  • S3-compatible
  • geo-distributed
  • SLOs and SLIs
    • Service level objectives
    • Set performance and reliability targets for a service as seen by its users over a period of time
    • Service level indicators
    • Example SLO: API availability SLO: 99.9% of API requests will not fail over n weeks
    • Example SLI: The percentage of API requests that do not fail

SLO implementation process

  1. Determine SLI types that best capture our users' experience
    • Understand how users interact with COS
    • Understand COS components and choose SLI types that best reflect user' experience
      • request-driven RESTFUL interface
  2. Define SLIs, the thing to measure
    • For the request-driven HTTP server
      • Availability SLI: % of requests that do not fail
      • Latency SLI: % of requests that complete in less than x seconds
  3. Choose how to measure these SLIs
    • Collect SLIs from COS load balancer logs
    • Instrument COS S3 client programs
    • Deply probers which perform common user actions
  4. Collect SLIs for a few weeks to get a baseline
    • Deployed probers
    • Record success and latency metrics per request type
  5. Infer error budgets from initial SLOs
    • Example: 99.9% availability over 4 weeks -> 0.1% requests could fail over 4 weeks

Conclusion

  • SLIs inform decisions for prioritizing reliability projects, doing capacity planning, etc
  • SLI graphs help id service issues
  • Users easily determine whether our service is appropriate for a particular use case based on SLOs
  • Use SLIs for monitoring and don't have to be paged if we're within the SLO
  • Choose a metrics collection service with a powerful query language
  • Data durability SLO implementation for storage systems can be tricky

Tips for SLOs

  • Never strive for 100% reliability
  • Understand the components of the system
  • Know how users interact wit h the system
  • Collect SLIs that measure the aspects of the system that matter to users

Fixing On-Call When Nobody Thinks It's Too Broken by Tony Lykke of Hudson River Trading

Why so much noise?

  • That's how its always been
  • 'Snowflake noise': special systems or integrations
  • We can't reduce noise unless we've got big corp money
  • It's better than it used to be

9 Really Hard Steps to Reduce Pager Noise

  1. Understand your audience
    • Consider why the team has it's current attitude toward pages
  2. Understand the problem
    • Find the data
    • Look at your incident history in PagerDuty
    • Use graphs to help your data analysis
  3. Understand the system
    • What technologies are you using?
    • What does the code look like?
    • How is automation involved?
  4. Devise a Game Plan
    • This doesn't have to be comprehensive
    • Go after low-risk, high-impact changes first
    • Communicate the plan and ask for feedback
    • Listen to the data
  5. Get Permission (optional)
    • Ask for forgieness instead of permission?
    • Use the data you've collected
    • Over-communicate
    • You will break things. Let the on-call person know what you're trying to do
  6. Lay the Groundwork
    • Neglect creates technical debt
    • Make your changes
    • Setup CI/CD
  7. Fix the Lowest Hanging Fruit
    • A data visualization may show you the low hanging fruit
  8. Communicate, Communicate, Communicate
    • Blog posts
    • RFCs
    • Documentation
    • Announcements
  9. Go Back to Step 7

Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Teams' Value by Aaron Wieczorek, USDS

  • The USDS was created after the redo of healthcare.gov
  • Currently, USDS steps in when there is a crisis
    • Example: airnow.gov during 2018 California Wildfires

How do we find these problems before they are a crisis?

  • Monitor every .gov service
  • There are ~25,000 services and apps for .gov and .mil

Custom solution as MVP

  • Scripts that send requests
  • Python requests and CLI

To build out the monitoring

  • Prometheous, grafana and influxdb

Lessons Learned

  • Proactive monitoring allows immediate incident response
  • Sometimes targets don't like it when you send lot of requests in 3-5 minutes
  • Dashboards with this many endpoints are hard
    • What kind of time-series data are you pulling down?
  • Alerting is hard
  • Tuning monitoring settings for a large system is hard

Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way by Michael Kehoe and Todd Palino of Linkedin

When You've Declared Code Yellow

  • Problem Statement:
    • Admit there is a problem
    • Measure it
    • Understand it
    • Determine the underlying causes that need to be fixed
  • Exit Criteria
    • Define concrete goals
    • Define success criteria
    • Define timelines
  • Get the help your require
    • Ask other teams for help
    • Get dedicated engineers, PMs, etc.
    • Timebound
  • Planning
    • Plan out short-term work
    • Plan long-term projects
    • Prioritize work that will reduce toil and burnout
  • Communication and Partnerships
    • Communicate problem statement and exit criteria
    • Send regular progress updates
    • Ensure that stakeholders understand delays and expected outcomes

Create a Code Review Culture by Jonathan Turner of Squarespace

Code Reviews are Useful Because...

  • it ensures higher quality code
  • is a communication platform
  • provides an opportunity to teach

Be intentional about your culture by...

  • explicitly describe what your culture entails
  • establishing a community of experts
  • developing new experts
  • training code reviewers

Advice for Code Authors

  • Make the reviewer's life easier by communicating as much context as you can
  • Establish your PR style with a PR guide
    • If no guide, write a good description
  • Make the PR a manageable size
    • What's the smallest vertical slice of functionality meaningful to your users?

Advice for Code Reviewers

  • Automate the nits
  • Know when to take the PR review offline
  • Communicate mutual respect
    • be as thorough as the PR needs
    • Review in passes. (Make a master PR review checklist)
  • John's PR Review Checklist:
    • Size it up (what's the shape of the PR, is the PR the right size?)
    • Context
      • What is the PR trying to accomplish?
      • Why is this PR trying to accomplish that?
      • Does the PR accomplish what it says?
    • Relevance
      • Is the change necessary?
      • Is code the right solution?
      • Are there other people that should be aware of this PR?
    • Readability
      • Is the the change understood without knowing the specific language?
      • Are any esoteric language features being used?
    • Production Readiness
      • How will we know when this breaks?
      • Is there new documentation required?
      • Are there tests that prevent regression?
      • Is the change secure?
    • Naming
      • Do names communicate what things do?
      • Are the names of things idiomatic to the language?
      • Do the names leak implementation details?
    • Gotchas
      • What are ways the code can break?
      • Is the code subject to any common programming gotchas?
      • Is spelling correct and consistent?
    • Language specific
      • Is the code well designed?
      • Is the code idiomatic to the language?
      • Are new patterns introduced?
      • Does the code fall prey to common pitfalls of the language?
  • John's Code Review Checklist

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance by Lynn Root of Spotify

Tracing Overview

  • A trace follows a complete workflow: the start of a request to its end
    • It's easy to trace a simple request
    • When you have services, there are a lot more places the data flows through
  • Historically, tracing has been machine-centric. We want workflow-centric tracing
  • Workflow-centric tracing lets you see dependencies

Why trace?

  • Performance analysis
  • Anomaly detection
  • Profiling (interested in just one component)
  • Resource attribution
  • Workload modeling
    • You can begin asking 'what if?' questions

Approaches to Tracing

  1. Manual
  2. Blackbox
  3. Metadata propagation

Four Things to Think About

  1. What relationships will you track?
  2. How to track them
  3. Which sampling approach to take
  4. How to visualize

How to Sample

  • Head-based
    • Makes random sampling decisions at the beginning of the workflow
  • Tail-based
    • Makes decisions at the end of the workflow
  • Unitary

What to visualize?

  • Gantt charts only show requests from a single trace
  • Request flow graph
  • Context calling tree
Clone this wiki locally