Skip to content

Commit

Permalink
add post about Human Error and getting off the hook
Browse files Browse the repository at this point in the history
  • Loading branch information
mrtazz committed Aug 4, 2014
1 parent f25eae0 commit 403152b
Showing 1 changed file with 167 additions and 0 deletions.
167 changes: 167 additions & 0 deletions _posts/2014-08-04-human-error-getting-off-the-hook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
layout: post
title: "Human Error and Getting off the Hook"
---

##[{{page.title}}]({{ page.url }})
I've been interested in the field of human error and system safety for a while
now. My original interest in it got sparked by talking to [John
Allspaw][allspaw] and ultimately reading Sidney Dekker's [The Field Guide to
Understanding Human Error][fieldguide] which gives a very good introduction to
the topic. The book gives a lot of examples of things that have gone bad -
often in aircraft control - and even given the fact that I read most of it on
a 14 hour long flight I can definitely recommend it. I have since then
participated in a book club about the field guide and completed an informal
course about learning how to facilitate a [blameless postmortem][postmortem]
taught by John. This approach of figuring out what happened and why it
happened in a blameless manner makes a lot of sense to me. I have worked in
more traditional places before where incidents weren't investigated in such a
way and I always had the feeling that there was something missing. That the
full story was never really uncovered - not even close. Over time I have
talked to quite a lot of people about the New View, this new way of thinking
about what contributes to incidents and blamelessly investigating them. And
when I talk to people about it who have never heard of this before or are
new to the topic, there is usually one question that comes up really quickly
and is usually something along the lines of:

> But isn't this just a cheap way of getting off the hook?
This is why I decided to write down my thoughts about why this isn't the case
and what the New View is about relating to responsibility and trying to
prevent the same incident from happening again.

First let's look into what we are working with every day. Be it in airtraffic
control and flying airplanes, operating modern trains, working in a hospital
and taking care of patients or keeping a website running. All of those are
complex socio-technical systems. That means the systems as a whole consist of
many many technological parts and humans as operators interact with them. And
they are big and complex enough as so they are in itself intractable for any
person. This means at no point is there a simple and clear plan to follow and
at no point is it possible for anyone to fully describe the system end to end
with all of its interactions. This means anyone working within the system has
to choose carefully between checking every single step for any risk
imaginable and actually getting work done (something that Erik Hollnagel calls
ETTO or Efficiency-Thoroughness-Trade-Off). For example an airplane pilot
might speed up going through the pre take-off checklisting because being
extremely thorough almost certainly means introducing delays or maybe even
missing the plane's flight slot. A doctor maybe only goes through the part of
the patient report that is relevant to the immediate action or surgery they
are about to do because of the huge number of patients they have to take care
of. A software engineer wants to make something faster to provide a better
experience for the user and subsequently brings down the site by exhausting
available resources too fast.

Now these very specific examples might seem like people slacking off and if
they just did all those things according to the rules and regulations,
everything would be fine and nothing can go wrong. And conversely if we fire
the person that caused that deviation from the rules we have a perfectly
simple reason why our complex system broke. This is a very natural
approach to accident investigation. Even Nietzsche talked about it before.

> In the search for a cause of an accident we do tend to stop, in the words of
> Nietzsche, by ‘the first interpretation that explains the unknown in familiar
> terms’ and ‘to use the feeling of pleasure … as our criterion for truth.’
>
><p class="cite">
> &mdash; <cite>Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness Trade-Off (p. 10)</cite>
></p>
The reality is however that in complex, intractable systems it's impossible to
follow all the rules and attend to work with 100% thoroughness. There is even
a behaviour called ["work-to-rule"][worktorule] that describes the action of
working exactly what the rules describe and thus causing a slowdown that can
even come close to a stop.

So now that we have established that people take
shortcuts and thoroughness tradeoffs all the time, we can also safely assume -
as those actions are likely what makes sense at the time - that other
operators (would) do the same. Bringing us to a point where certain actions
that just before seemed like the cause of trouble are now considered to be a
natural behaviour of people working within the system. And as Sidney Dekker
puts it so aptly:

> Indeed, as soon as you have reason to believe that any other practitioner
> would have done the same thing as the one whose assessments and actions are
> now controversial, you should start looking at the system.
>
><p class="cite">
> &mdash; <cite>Sidney Dekker, The Field Guide to Understanding Human Error (p. 195)</cite>
></p>
But is that really true? Maybe all the others who would have done this very
thing a dozen times before just had better judgement? Maybe they just were
more aware of the situation and noticed that it would be an appropriate
response. Whereas in the failure case the operator just failed to recognize
that now was not the time to do this. It turns out smart people have thought
about this very thing before. The Austrian physicist and philosopher Ernst
Mach came to [this conclusion in 1905][ernstmach]:

> Erkenntnis und Irrtum fließen aus denselben psychischen Quellen; nur der
> Erfolg vermag beide zu scheiden. Der klar erkannte Irrtum ist als Korrektiv
> ebenso erkenntnisfördend wie die positive Erkenntnis.
>
><p class="cite">
> &mdash; <cite>Ernst Mach, Erkenntnis und Irrtum (p. 116)</cite>
></p>
This translates to something like "knowledge and error flow from the same
mental sources; only success can tell one from the other. A clearly recognized
error as a corrective is fostering knowledge as much as positive
realization.". Which makes it clear that the decision of whether or not
something was "the right thing to do" or an error is defined by post-hoc
analysis of the situation. An advantage the operator didn't have in the
situation.

So now what? Our precious theory about the bad apple is gone. Where do we go
from there? Are we not allowed to talk about human actions at all anymore?

Quite the opposite.

Humans are a crucial part of socio-technical systems. But they are what makes
systems safe. As we have said before, our complex systems are in large parts
intractable and thus there is no way we could design a ruleset of things that
we could have a machine execute and everything would be safe. The thousands of
little adjustments, human operators carry out every minute are the pillars of
our system safety. With the little crux that they sometimes also lead to
adversarial outcomes. And this is the part we are interested in. How does
something that is done over and over again seemingly suddenly lead to an
incident? Why does it make sense for a person in that situation to act in the
way they did? After all the basic assumption is that people don't come to work
to do a bad job.

> There is a difference between explaining and excusing human performance.
>
><p class="cite">
> &mdash; <cite>Sidney Dekker, The Field Guide to Understanding Human Error (p. 196)</cite>
></p>
So does the human operator in this New View get off the hook? The answer here
is no. Because thinking about failure and outages in this way means the
practitioner was never on the hook for explaining their behaviour in the first
place. However being part of an incident on the very sharp end of the
situation brings some new responsibilities with it. It means the human is now
the specialist with most of the knowledge about how the system surprised us
and broke down. They know best how they expected the system to react and what
it actually did. They are the foremost authority on what detections they
utilized and what to put in place to realize faster that something is going
wrong. They know which tools they reached for, which they had to improvise,
and which tools they were missing. This means they are very much on the hook.
But on the hook for helping to find ways to make the system safer going
forward.

If this has sparked your interest in the field, my coworker [Ian][indec] has
also recently published a set of resources on the [Etsy Engineering
blog][justculture] to get started with the topic of System Safety, Human Error
and Just Culture.




[fieldguide]: http://amzn.com/0754648265
[etto]: http://amzn.com/B009KOA6LA
[allspaw]: http://www.kitchensoap.com/
[postmortem]: http://codeascraft.com/2012/05/22/blameless-postmortems/
[worktorule]: http://en.wikipedia.org/wiki/Work-to-rule
[justculture]: http://codeascraft.com/2014/07/18/just-culture-resources/
[ernstmach]: https://archive.org/download/erkenntnisundirr00machuoft/erkenntnisundirr00machuoft.pdf
[indec]: https://twitter.com/indec

0 comments on commit 403152b

Please sign in to comment.