Skip to content

Resilience Engineering

Kim Schlesinger edited this page Apr 21, 2020 · 1 revision

The Future of DevOps is Resilience Engineering

Amy Tobey at Failover Conf, 21 April 2020

Terms

Resilience Engineering

In the fields of engineering and construction, resilience is the ability to absorb or avoid damage without suffering complete failure and is an objective of design, maintenance and restoration for buildings and infrastructure, as well as communities.

  • Designing systems so that they can recover from failure

Socio-technical systems

  • A system created by people who leverage technology
  • Example: Daft Punk

Common Ground

  • When a group of people have a shared context that is communicated through shared language and rituals
  • Example: a jazz combo that can create music through a combination of calling jazz standards, and applying musical keys and styles

Cognitive Capacity

  • How much thinking juice you have 😁
  • Spoon Theory is a way some disabled people describe cognitive capacity, and how the tasks of everyday living as a person with a disability can deplete your capacity faster than people who are able-bodied.

Joint cognitive systems

From the Flight Safety Foundation:

a system in which humans interact with machines and each other to maintain control of a safety-critical activity.

Adaptive Capacity

the ability of institutions and networks to learn, and store knowledge and experience

Resilience Engineering and DevOps

  • The cause of an outage is never human error. It is the environment and system that led a human to make a decision that caused the outage
  • There is no such thing as a root cause. There is such as thing as the most likely reason an outage occured.
  • We must learn from successes, not just failure. Don't just do post mortems, study what happened when things are going well.

Recommended Reading:

Clone this wiki locally