Add the parable, the proposal, and the questions.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
kubernetes-sigs · May 11, 2023 · 9ebda33 · 9ebda33
1 parent aabddcf
commit 9ebda33
Showing 1 changed file with 249 additions and 2 deletions.
diff --git a/geps/x.md b/geps/x.md
@@ -23,10 +23,257 @@ that is declarative at the affected resource level.
 - Provide new semantics to incorporate `Policy` resources at the level of the `Resource` that
   will be affected.
 
-## Introduction
+## The Problem: A Parable of Jane
 
-TODO
+It's a sunny Wednesday afternoon, and the lead microservices developer for
+Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the
+past two and a half weeks, but after successfully deploying version 3.6.0 of
+the `baker` service that morning, she left early to try to unwind a bit.
+
+Her shoulders are just starting to unknot when her phone pings with a text
+from Julian, down in the NOC. Waterproof phones are a blessing, but also a
+curse.
+
+**Julian**: _Hey Jane. Things are still running, more or less, but latencies
+on everything in the baker namespace are crazy high after your last rollout,
+and baker itself has a weirdly high load. Sorry to interrupt you on the lake
+but can you take a look? Thanks!!_
+
+Jane stares at the phone for a long moment, then slumps and heads back to
+shore to dry off and grab her laptop.
+
+What she finds is strange. `baker` is taking a _lot_ of load, almost 4x what’s
+being reported by its usual clients, and its clients report that calls are
+taking much longer than they’d expect them to. She doublechecks the
+Deployment, the Service, and all the HTTPRoutes around `baker`; everything
+looks good. `baker`’s logs show her mostly failed requests... with a lot of
+duplicate requests? Jane checks her HTTPRoute again, though she's pretty sure
+you can't configure retries there, and finds nothing. But it definitely looks
+like a client is retrying when it shouldn’t be.
+
+She pings Julian.
+
+**Jane**: _Hey Julian. Something weird is up, looks like requests to `baker`
+are failing but getting retried??_
+
+A minute later he answers.
+
+**Julian**: 🤷 _Did you configure retries?_
+
+**Jane**: _Dude. I don’t even know how to._ 😂
+
+**Julian**: _You attach a RetryPolicy attached to your HTTPRoute?_
+
+**Jane**: _Nope. Definitely didn’t do that._
+
+She types `kubectl get retrypolicy -n baker` and gets a permission error.
+
+**Jane**: _Huh, I actually don’t have permissions for RetryPolicy._ 🤔
+
+**Julian**: 🤷 _Feels like you should but OK, guess that can’t be it._
+
+Minutes pass while both look at logs.
+
+**Jane**: _OK, it’s definitely retrying. Nearly every request fails the first
+few times, gets retried, and then finally succeeds?_
+
+**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests…_
+
+**Jane**: _Check both logs for request ID
+6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker`
+shows it arriving four times in quick succession. Only the fourth one
+succeeds. That has to be retries._
+
+Another pause.
+
+**Julian**: _I’m an idiot. There’s a RetryPolicy for the whole namespace –
+sorry, too many policies in the dashboard and I missed it. Deleting that since
+you don’t want retries._
+
+**Jane**: _Are you sure that’s a good–_
+
+Jane’s phone shrills while she’s typing, and she drops it. When she picks it
+up again she sees a stack of alerts. Quickly flipping through them, she feels
+the blood drain from her face: there’s one for every single service in the
+`baker` namespace.
+
+**Jane**: _PUT IT BACK!!_
+
+**Julian**: _Just did. Be glad you couldn't hear all the alarms here._ 😕
+
+**Jane**: _What the hell just happened??_
+
+**Julian**: _At a guess, all the workloads in the `baker` namespace actually
+fail a lot, but they seem OK because there are retries across the whole
+namespace?_ 🤔
+
+Jane’s jaw drops.
+
+**Jane**: _You’re saying that ALL of our services are broken??!_
+
+**Julian**: _That’s what it looks like. Guessing your `baker` rollout would
+have failed without retries turned on._
+
+There is a pause while Jane thinks through increasingly unpleasant possibilities.
+
+**Jane**: _I don't even know where to start here. How long did that
+RetryPolicy go in? Is it the only thing like it?_
+
+**Julian**: _I didn’t look closely before deleting it, but I think it said a
+few months ago. And there are lots of different kinds of policy and lots of
+individual policies, hang on a minute…_
+
+**Julian**: _Looks like about 47 for your chunk of the world, a couple hundred
+system-wide._
+
+**Jane**: 😱 _Can you tell me what they’re doing for each of our services? I
+can’t even_ look _at these things._ 😕
+
+**Julian**: _That's gonna take awhile. Our tooling to show us which policies
+bind to a given workload doesn't go the other direction._
+
+**Jane**: _…Wait. You have to_ build tools _to figure out basic configuration??_
+
+Pause.
+
+**Julian**: _Policy attachment is more complex than we’d like, yeah._ 😐
+_Look, how ‘bout roll back your `baker` change for now? We can get together in
+the morning and start sorting this out._
+
+Jane shakes her head and rolls back her edits to the `baker` Deployment, then
+sits looking out over the lake as the deployment progresses.
+
+**Jane**: _Done. Are things happier now?_
+
+**Julian**: _Looks like, thanks. Reckon you can get back to your sailboard._ 🙂
+
+Jane sighs.
+
+**Jane**: _Wish I could. Wind’s died down, though, and the sun is almost gone.
+May as well head home._
+
+One more look out at the lake.
+
+**Jane**: _Thanks for the help. Wish we’d found better answers._ 😢
+
+## The Proposal
+
+The fundamental problem with policy attachment is that it **breaks the core
+premise of Kubernetes as a declarative system**, because it’s not declarative:
+it sets the world up for a sort of spooky action at a distance, to borrow
+Einstein’s phrase. We acknowledge that policy attachement is not the only
+place where we see this in Kubernetes, of course! but we submit that we should
+probably not be adding more such places.
+
+Given that the fundamental problem is that policy attachement isn't
+declarative as written and should be made declarative, there is only one
+fundamental answer: we need to modify the Kubernetes core resources to include
+extension points where a given object refers to its modifier, rather than
+having the modifying resource try to attach to its source. This is an ugly
+job, but it’s the only way to deal with this situation.
+
+This GEP proposes to start this process with the Gateway API resources.
 
 ## API
 
 TODO: future iteration
+
+## Questions and Answers
+
+**Q**: _Why are you implying that there’s a problem with policy attachment?
+Isn’t your parable really just showing us that Jane and Julian work for a
+dysfunctional organization?_
+
+**A**: As written, Evil Genius Cupcakes is far from the most dysfunctional
+organization I’ve seen. Jane and Julian support each other, neither casts
+blame, both are clearly trying to do their best by the organization and their
+customers even to their own cost. So the organization isn't really the
+problem.
+
+**Q**: _No organization would actually install a namespace-wide retry policy
+and then forget about it, though._
+
+**A**: I literally cannot even begin to count the number of times I’ve seen
+something like this happen.
+
+The most common scenario goes like this: it’s 8PM on a Friday and something
+goes wrong. There is much screaming, wailing, and gnashing of teeth as the
+on-call staff try to figure out what’s up. Inevitably, the SME is on vacation.
+Someone suggests retries and they hastily slap in the CRD to enable them. The
+post-mortem gets rescheduled a few times, and/or the person writing up the
+timeline mistakenly notes that the retries were enabled for a given workload
+rather than for the entire namespace, and no one ever figures out that error.
+It creates an action item of “fix this workload to not need retries”, that
+goes into the backlog, and it gets pushed down by more critical items.
+
+**Q**: _Okay, but in the real world, removing the RetryPolicy wouldn’t affect
+every workload._
+
+**A**: As soon as the namespace-wide RetryPolicy goes in, Jane’s team largely
+loses the backstop of progressive rollout. As long as their workloads don’t
+fail 100% of the time, progressive rollout will likely succeed; after a few
+months, it’s not even close to unlikely that every service will actually be
+failing pretty often.
+
+**Q**: _Fine. But in the real world, Jane would be able to see all the policy
+objects herself, and this would be a non-issue._
+
+**A**: Quick, write me a kubectl query to fetch every policy CRD that’s
+attached to an arbitrary object. Go ahead. I’ll wait. Make sure you get policy
+CRDs attached to the enclosing namespace, too.
+
+…
+
+There’s a big difference between “having permission to see” and “being able to
+effectively query and understand”. As policy attachment currently stands, you
+need to be able to query many different kinds of CRDs _and_ filter them in a
+couple of different ways that existing tooling isn't very good at.
+
+**Q**: _Well then, in the real world, Jane would have access to higher-level
+tools that know how to do that._
+
+**A**: Those tools need to be written, and Jane and her team need to be taught
+that the tools exist and how to use them. From Jane’s point of view, those
+tools are adding friction to her job, and honestly she’s right: why should she
+need to learn funky new tools instead of just putting the right thing in her
+HTTPRoutes?
+
+**Q**: _What if we give Julian those tools? He could cope with them._
+
+**A**: Sure, but now you’re back to a world in which Jane isn’t
+self-sufficient and has to bottleneck on Julian. Neither of them will like
+that.
+
+**Q**: _Doesn't direct policy attachment make things better?_
+
+**A**: Not really, no. The only real effect is that if you use direct policy
+attachment, you can’t land in a scenario that I considered but didn’t write
+about: in that one, Julian tries to tweak the RetryPolicy to disable the
+retries for `baker` alone, but runs afoul of an override installed by Jasmine
+from the cluster-ops team, which Julian doesn’t have permission to change… so
+he literally can’t even turn them off.
+
+**Q**: _OK, so isn’t this really just a retry thing? It’s not like all
+policies can affect things so broadly._
+
+**A**: Stating the obvious here: the whole point of policy attachment is to
+set policy. By definition, policy has very broad capabilities. Retry is
+actually a fairly narrow function: suppose the attached policy was a WAF which
+was intentionally applied on every namespace (gotta protect everything!), and
+Jasmine mistakenly changed its configuration? That could affect everything in
+the entire cluster – possibly only a week after Jasmine made the change, when
+the WAF gets an update that interacts poorly with the configuration change.
+
+**Q**: _Dude, c’mon. That’s Jasmine and the WAF shooting themselves in the
+foot, not a problem with policy attachment._
+
+**A**: You’re right that policy attachment didn’t cause the retry issue we
+looked at first, nor would it cause the WAF problem above. But it does make it
+much harder for Jane (the human directly affected) to understand what’s
+happening so she can fix it. That’s the problem that I’m concerned about.
+
+**Q**: _So you’re saying this is just impossible then, and you’re not
+listening to anything I ask._
+
+**A**: Well, most of your questions aren’t questions! But more importantly,
+see the next section.