Skip to content

SRECon19 Day3

kimschles edited this page Mar 27, 2019 · 3 revisions

SRECon 2019 Day 3

Preparation

  • Problem solving is easier with constraints
    • Use the Dickeron Hierarchy of Site Reliability from Google SRE Handbook to decide on a constraint
  • Gagne's Hierarchy of Learning

Gaining Knowledge

  • Avoid 'The Illusion of Knowing'
    • Low-stakes testing
      • Frequent use of evaluation instruments
      • 'To learn, retrieve'
  • To learn, struggle
    • Delayed retrieval and interleaving
    • Try Leitner Boxes
    • Try memory palaces

Mental Models

  • Turn patterns and events into abstractions and stories
    • Events -> patterns -> structure
    • Observe -> reflect ->
    • Run incident reviews

Learning Together

  • mnemonic convergence
    • Cultural memory
  • Growth mindset
  • Psychological saftey

This talk seeks to answer this question: How can an individual contributor influence availability at company-wide scale?

After a major incident, interview each person involved. At github, they use format below.

The 1:1 Incident Debrief

  • introductions and agenda
  • informed consent
    • let people know who this data will be shared with
  • Ask and record the answers to the following questions:
    • what was your role in the incident
    • what surprised you?
    • how long did you work on the incident?
      • probe to see if people burned-out during the incident
    • were you able to get the support you needed?
    • do you feel that the incident was preventable?
    • what actions do you feel good about?
    • what do you think could have been better?
    • what did you learn from this incident?
    • what do you think we can do to prevent reoccurrence?
    • did our tools and documentation serve you well?
    • did you practice self-care during this process?
      • this implies that you should take care of yourself during while responding to an incident
    • can you think of anyone else we should talk to?

Nikolaus's Team

  • 40 people in 2 locations
  • Support ~400 google services
  • Typical workload:
    • 30% interrupt work
    • 20% service maintenance
    • 50% project work
  • 2 on call rotations, several service ownership groups and project groups

Sublinear Scaling

  • Trying to achieve sublinear scaling is the difference between an SRE team and an ops team
  • Sublinear scaling is maintaining more service with fewer people
  • This is achieved with automation

Definitions of Automation

  • imperative vs. declarative automation
    • declarative means you specify the end-state you want
    • example of imperative automation: move service between datacenters
      • You understand the manual steps, and then automate them
        • Disadvantages:
          • you need a clean starting point
          • what do you do is your automation crashes?
          • you cannot change course which the automation is running
          • you need a separate recipe for each task and you duplicate steps
    • example of declarative automation: service should run in datacenter 'foo'
      • You declare the state that you want, not how to get there
        • Actuator
        • Sequencer (checks against the global ruleset)
      • This is more complex than a script

What is automated?

  • Intent defines:
    • redundancy
    • proximities
    • dependencies
    • SLO
    • Rollout policy
    • service specific configuration (flags)
  • What is automated?
    • Starts, stops and moves service
    • Selects DC, replica count, resource requirements
    • Setup monitoring and alerting
    • Runs load-tests
    • Releases new versions
    • Enforces policy (for example, you can change prod on Friday afternoon)
  • The benefits of automation go to the right and down (they get worse)

The goals of 1k SRE Project

  • Automation stays the same
  • Many pages are handled automatically
  • Onboarding is done by developers
  • services as cattle, not pets
  • Automated incident handling

Summary

  • We doubled the number of supported services by increasing automation

Hybrid automation (a blend between human work and automated work managed by checklists) Max's team was tasked with automating the creation of new GCP regions

Two Challenges:

  1. What is the process to build a cloud location?
  2. Define and automate the processes do deploy and configure the systems

The Process

  • The effort to automate often has little payoff. It's often better for a person to do it manually.
  • Effective automation should amplify people
  • A lot of ideas about automation were described by Taiichi Ohno in Toyota Production System: Beyond Large-Scale Production
  • Automation itself is not the goal. You care about the outcome of the automated process.
  • Writing a checklist is the first step of automating (understand what the human does first)
  • Think about the interface
  • Make incremental steps
    • Again, start with checklists, then write code
  • Human Report Procedure Calls (use your issue tracking platform to track items on your checklists)
  • Google reused their Sisyphs release tool to run their checklists
  • Hybrid automation at scale

Summary

  • Automation isn't the goal, the impact is
  • Design for incremental delivery
  • Start with interfaces, checklists and SLOs
  • Amplify human judgement

Consider these questions about your company:

  • Where does SRE connect into the executive hierarchy?
  • Where does SRE connect into the exec hierarchy?
  • Who defines reliability targets?
  • How do you (SRE) decide what to work on?
  • How do you engage with feature teams regarding planning future work?
  • What sorts of things would be “not SRE work” at your company? Who would own those things? Why (if you know)?
  • By whom and how does toil impact get controlled?
  • What is “success” for your SRE team?
Clone this wiki locally