Skip to content

Commit

Permalink
Wordsmithing.
Browse files Browse the repository at this point in the history
Signed-off-by: Flynn <emissary@flynn.kodachi.com>
  • Loading branch information
kflynn committed May 12, 2023
1 parent 9ebda33 commit 7ac75d1
Showing 1 changed file with 66 additions and 55 deletions.
121 changes: 66 additions & 55 deletions geps/x.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,28 @@ that is declarative at the affected resource level.
It's a sunny Wednesday afternoon, and the lead microservices developer for
Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the
past two and a half weeks, but after successfully deploying version 3.6.0 of
the `baker` service that morning, she left early to try to unwind a bit.
the `baker` service that morning, she escaped early to try to unwind a bit.

Her shoulders are just starting to unknot when her phone pings with a text
from Julian, down in the NOC. Waterproof phones are a blessing, but also a
curse.

**Julian**: _Hey Jane. Things are still running, more or less, but latencies
on everything in the baker namespace are crazy high after your last rollout,
and baker itself has a weirdly high load. Sorry to interrupt you on the lake
on everything in the `baker` namespace are crazy high after your last rollout,
and `baker` itself has a weirdly high load. Sorry to interrupt you on the lake
but can you take a look? Thanks!!_

Jane stares at the phone for a long moment, then slumps and heads back to
shore to dry off and grab her laptop.
Jane stares at the phone for a long moment, heart sinking, then slowly tacks
back to shore to dry off and grab her laptop.

What she finds is strange. `baker` is taking a _lot_ of load, almost 4x what’s
being reported by its usual clients, and its clients report that calls are
taking much longer than they’d expect them to. She doublechecks the
Deployment, the Service, and all the HTTPRoutes around `baker`; everything
What she finds when she logs in is strange. `baker` is taking a _lot_ of load,
almost 4x what’s being reported by its usual clients, and its clients report
that calls are taking much longer than they’d expect them to. She doublechecks
the Deployment, the Service, and all the HTTPRoutes around `baker`; everything
looks good. `baker`’s logs show her mostly failed requests... with a lot of
duplicate requests? Jane checks her HTTPRoute again, though she's pretty sure
you can't configure retries there, and finds nothing. But it definitely looks
like a client is retrying when it shouldn’t be.
duplicates? Jane checks her HTTPRoute again, though she's pretty sure you
can't configure retries there, and finds nothing. But it definitely looks like
a client is retrying when it shouldn’t be.

She pings Julian.

Expand All @@ -62,7 +62,7 @@ A minute later he answers.

**Jane**: _Dude. I don’t even know how to._ 😂

**Julian**: _You attach a RetryPolicy attached to your HTTPRoute?_
**Julian**: _You just attach a RetryPolicy to your HTTPRoute._

**Jane**: _Nope. Definitely didn’t do that._

Expand All @@ -77,7 +77,7 @@ Minutes pass while both look at logs.
**Jane**: _OK, it’s definitely retrying. Nearly every request fails the first
few times, gets retried, and then finally succeeds?_

**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests_
**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests..._

**Jane**: _Check both logs for request ID
6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker`
Expand Down Expand Up @@ -121,7 +121,7 @@ RetryPolicy go in? Is it the only thing like it?_

**Julian**: _I didn’t look closely before deleting it, but I think it said a
few months ago. And there are lots of different kinds of policy and lots of
individual policies, hang on a minute_
individual policies, hang on a minute..._

**Julian**: _Looks like about 47 for your chunk of the world, a couple hundred
system-wide._
Expand All @@ -132,7 +132,7 @@ can’t even_ look _at these things._ 😕
**Julian**: _That's gonna take awhile. Our tooling to show us which policies
bind to a given workload doesn't go the other direction._

**Jane**: _…Wait. You have to_ build tools _to figure out basic configuration??_
**Jane**: _...wait. You have to_ build tools _to figure out basic configuration??_

Pause.

Expand Down Expand Up @@ -161,9 +161,9 @@ One more look out at the lake.
The fundamental problem with policy attachment is that it **breaks the core
premise of Kubernetes as a declarative system**, because it’s not declarative:
it sets the world up for a sort of spooky action at a distance, to borrow
Einstein’s phrase. We acknowledge that policy attachement is not the only
place where we see this in Kubernetes, of course! but we submit that we should
probably not be adding more such places.
Einstein’s phrase. Policy attachment is not the only place where we see this
in Kubernetes, of course! but we submit that we shouldn't be adding any more
such places.

Given that the fundamental problem is that policy attachement isn't
declarative as written and should be made declarative, there is only one
Expand All @@ -184,7 +184,7 @@ TODO: future iteration
Isn’t your parable really just showing us that Jane and Julian work for a
dysfunctional organization?_

**A**: As written, Evil Genius Cupcakes is far from the most dysfunctional
**A**: As written, Evil Genius Cupcakes is _far_ from the most dysfunctional
organization I’ve seen. Jane and Julian support each other, neither casts
blame, both are clearly trying to do their best by the organization and their
customers even to their own cost. So the organization isn't really the
Expand All @@ -203,26 +203,29 @@ Someone suggests retries and they hastily slap in the CRD to enable them. The
post-mortem gets rescheduled a few times, and/or the person writing up the
timeline mistakenly notes that the retries were enabled for a given workload
rather than for the entire namespace, and no one ever figures out that error.
It creates an action item of “fix this workload to not need retries”, that
goes into the backlog, and it gets pushed down by more critical items.
The post-mortem results in an action item of “fix this workload to not need
retries so we can turn retries off”, that goes into the backlog, and it gets
pushed down by more critical items.

That is a process problem for sure! but it's a sadly realistic one.

**Q**: _Okay, but in the real world, removing the RetryPolicy wouldn’t affect
every workload._

**A**: As soon as the namespace-wide RetryPolicy goes in, Jane’s team largely
loses the backstop of progressive rollout. As long as their workloads don’t
fail 100% of the time, progressive rollout will likely succeed; after a few
months, it’s not even close to unlikely that every service will actually be
failing pretty often.
loses the backstop of progressive rollout. As long as their workloads succeed
sometimes, progressive rollout has a good chance to succeed. After the few
months posited above, it’s not at all unlikely that every service will
actually be failing pretty often.

**Q**: _Fine. But in the real world, Jane would be able to see all the policy
objects herself, and this would be a non-issue._

**A**: Quick, write me a kubectl query to fetch every policy CRD that’s
attached to an arbitrary object. Go ahead. I’ll wait. Make sure you get policy
CRDs attached to the enclosing namespace, too.
**A**: Assuming permission to see everything necessary, please write me a
`kubectl` query to fetch every policy CRD that’s attached to an arbitrary
object. Remember to get policy CRDs attached to the enclosing namespace, too.

Challenging, no?

There’s a big difference between “having permission to see” and “being able to
effectively query and understand”. As policy attachment currently stands, you
Expand All @@ -232,11 +235,11 @@ couple of different ways that existing tooling isn't very good at.
**Q**: _Well then, in the real world, Jane would have access to higher-level
tools that know how to do that._

**A**: Those tools need to be written, and Jane and her team need to be taught
that the tools exist and how to use them. From Jane’s point of view, those
tools are adding friction to her job, and honestly she’s right: why should she
need to learn funky new tools instead of just putting the right thing in her
HTTPRoutes?
**A**: Those tools have yet to be written. Once they are, Jane and her team
will need to be taught that the tools exist and how to use them. From Jane’s
point of view, it's simpler not to need those tools: she'd rather just put the
right thing in her HTTPRoutes, and then be able to see them all when she reads
her HTTPRoutes.

**Q**: _What if we give Julian those tools? He could cope with them._

Expand All @@ -246,34 +249,42 @@ that.

**Q**: _Doesn't direct policy attachment make things better?_

**A**: Not really, no. The only real effect is that if you use direct policy
attachment, you can’t land in a scenario that I considered but didn’t write
about: in that one, Julian tries to tweak the RetryPolicy to disable the
retries for `baker` alone, but runs afoul of an override installed by Jasmine
from the cluster-ops team, which Julian doesn’t have permission to change… so
he literally can’t even turn them off.
**A**: Not really, no. Direct policy attachment is still spooky action at a
distance, so it doesn't really make things markedly better.

(That said, direct policy attachment _does_ sidestep a specific very
unpleasant scenario that I considered but didn’t write about. In that one,
Julian tries to tweak the RetryPolicy to disable the retries for just the
`baker` workload, but runs afoul of an override installed by Jasmine from the
cluster-ops team, which Julian doesn’t have permission to even see... so he
has to infer the existence of the override he can't see, and he can't do
anything about it.)

**Q**: _OK, so isn’t this really just a retry thing? It’s not like all
policies can affect things so broadly._

**A**: Stating the obvious here: the whole point of policy attachment is to
set policy. By definition, policy has very broad capabilities. Retry is
actually a fairly narrow function: suppose the attached policy was a WAF which
was intentionally applied on every namespace (gotta protect everything!), and
Jasmine mistakenly changed its configuration? That could affect everything in
the entire cluster – possibly only a week after Jasmine made the change, when
the WAF gets an update that interacts poorly with the configuration change.
**A**: To state the obvious, the whole point of policy attachment is to set
policy -- and by definition, policy has very broad capabilities. Retry is
actually a fairly _narrow_ function: suppose the attached policy was instead a
WAF which was intentionally applied on every namespace (gotta protect
everything!), and Jasmine mistakenly changed its configuration? That could
affect everything in the entire cluster – possibly only a week after Jasmine
made the change, when the WAF gets an update that interacts poorly with the
configuration change.

**Q**: _Dude, c’mon. That’s Jasmine and the WAF shooting themselves in the
foot, not a problem with policy attachment._

**A**: You’re right that policy attachment didn’t cause the retry issue we
looked at first, nor would it cause the WAF problem above. But it does make it
much harder for Jane (the human directly affected) to understand what’s
happening so she can fix it. That’s the problem that I’m concerned about.
looked at first, nor would it cause the WAF problem above. What we're
concerned about is that policy attachement _does_ make it much harder for Jane
to understand what's happening so that she can fix it. That will have a real
impact on real people.

**Q**: _So you're just saying that everything is impossible and you're not
listening to my questions._

**Q**: _So you’re saying this is just impossible then, and you’re not
listening to anything I ask._
**A**: Well, most of your "questions" aren't questions! 🙂

**A**: Well, most of your questions aren’t questions! But more importantly,
see the next section.
And we definitely think it's possible to do something about the situation;
that's what this proposal is all about.

0 comments on commit 7ac75d1

Please sign in to comment.