SRECon19 Day3

SRECon 2019 Day 3

Optimizing for Learning by Logan McDonald of Buzzfeed

Github Gist of Resources
Live Tweets
Logan studied Behavioral Economics
Senior engineers have 'expert intuition'
- We can develop expert intuition

Preparation

Problem solving is easier with constraints
- Use the Dickeron Hierarchy of Site Reliability from Google SRE Handbook to decide on a constraint
Gagne's Hierarchy of Learning

Gaining Knowledge

Avoid 'The Illusion of Knowing'
- Low-stakes testing
  - Frequent use of evaluation instruments
  - 'To learn, retrieve'
To learn, struggle
- Delayed retrieval and interleaving
- Try Leitner Boxes
- Try memory palaces

Mental Models

Turn patterns and events into abstractions and stories
- Events -> patterns -> structure
- Observe -> reflect ->
- Run incident reviews

Learning Together

mnemonic convergence
- Cultural memory
Growth mindset
Psychological saftey

One on One SRE by Amy Robey of Github

Live Tweets

This talk seeks to answer this question: How can an individual contributor influence availability at company-wide scale?

After a major incident, interview each person involved. At github, they use format below.

The 1:1 Incident Debrief

introductions and agenda
informed consent
- let people know who this data will be shared with
Ask and record the answers to the following questions:
- what was your role in the incident
- what surprised you?
- how long did you work on the incident?
  - probe to see if people burned-out during the incident
- were you able to get the support you needed?
- do you feel that the incident was preventable?
- what actions do you feel good about?
- what do you think could have been better?
- what did you learn from this incident?
- what do you think we can do to prevent reoccurrence?
- did our tools and documentation serve you well?
- did you practice self-care during this process?
  - this implies that you should take care of yourself during while responding to an incident
- can you think of anyone else we should talk to?

The 1k SRE Project: Sublinear Scaling in Practice by Nikolaus Rath of Google

Nikolaus's Team

40 people in 2 locations
Support ~400 google services
Typical workload:
- 30% interrupt work
- 20% service maintenance
- 50% project work
2 on call rotations, several service ownership groups and project groups

Sublinear Scaling

Trying to achieve sublinear scaling is the difference between an SRE team and an ops team
Sublinear scaling is maintaining more service with fewer people
This is achieved with automation

Definitions of Automation

imperative vs. declarative automation
- declarative means you specify the end-state you want
- example of imperative automation: move service between datacenters
  - You understand the manual steps, and then automate them
    - Disadvantages:
      - you need a clean starting point
      - what do you do is your automation crashes?
      - you cannot change course which the automation is running
      - you need a separate recipe for each task and you duplicate steps
- example of declarative automation: service should run in datacenter 'foo'
  - You declare the state that you want, not how to get there
    - Actuator
    - Sequencer (checks against the global ruleset)
  - This is more complex than a script

What is automated?

Intent defines:
- redundancy
- proximities
- dependencies
- SLO
- Rollout policy
- service specific configuration (flags)
What is automated?
- Starts, stops and moves service
- Selects DC, replica count, resource requirements
- Setup monitoring and alerting
- Runs load-tests
- Releases new versions
- Enforces policy (for example, you can change prod on Friday afternoon)
The benefits of automation go to the right and down (they get worse)

The goals of 1k SRE Project

Automation stays the same
Many pages are handled automatically
Onboarding is done by developers
services as cattle, not pets
Automated incident handling

Summary

We doubled the number of supported services by increasing automation

Pragmatic Automation by Max Luebbe of Google

Live Tweets

Hybrid automation (a blend between human work and automated work managed by checklists) Max's team was tasked with automating the creation of new GCP regions

Two Challenges:

What is the process to build a cloud location?
Define and automate the processes do deploy and configure the systems

The Process

The effort to automate often has little payoff. It's often better for a person to do it manually.
Effective automation should amplify people
A lot of ideas about automation were described by Taiichi Ohno in Toyota Production System: Beyond Large-Scale Production
Automation itself is not the goal. You care about the outcome of the automated process.
Writing a checklist is the first step of automating (understand what the human does first)
Think about the interface
Make incremental steps
- Again, start with checklists, then write code
Human Report Procedure Calls (use your issue tracking platform to track items on your checklists)
Google reused their Sisyphs release tool to run their checklists
Hybrid automation at scale

Summary

Automation isn't the goal, the impact is
Design for incremental delivery
Start with interfaces, checklists and SLOs
Amplify human judgement

Exploring SRE Differences Across Companies by Kurt Andersen of LinkedIn

This talk was a conversation with the audience
Notes from the discussion

Consider these questions about your company:

Where does SRE connect into the executive hierarchy?
Where does SRE connect into the exec hierarchy?
Who defines reliability targets?
How do you (SRE) decide what to work on?
How do you engage with feature teams regarding planning future work?
What sorts of things would be “not SRE work” at your company? Who would own those things? Why (if you know)?
By whom and how does toil impact get controlled?
What is “success” for your SRE team?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRECon19 Day3

SRECon 2019 Day 3

Optimizing for Learning by Logan McDonald of Buzzfeed

Preparation

Gaining Knowledge

Mental Models

Learning Together

One on One SRE by Amy Robey of Github

The 1:1 Incident Debrief

The 1k SRE Project: Sublinear Scaling in Practice by Nikolaus Rath of Google

Nikolaus's Team

Sublinear Scaling

Definitions of Automation

What is automated?

The goals of 1k SRE Project

Summary

Pragmatic Automation by Max Luebbe of Google

Two Challenges:

The Process

Summary

Exploring SRE Differences Across Companies by Kurt Andersen of LinkedIn

Consider these questions about your company:

Clone this wiki locally