Wordsmithing.

Signed-off-by: Flynn <emissary@flynn.kodachi.com>
kubernetes-sigs · May 12, 2023 · 7ac75d1 · 7ac75d1
1 parent 9ebda33
commit 7ac75d1
Showing 1 changed file with 66 additions and 55 deletions.
diff --git a/geps/x.md b/geps/x.md
@@ -28,28 +28,28 @@ that is declarative at the affected resource level.
 It's a sunny Wednesday afternoon, and the lead microservices developer for
 Evil Genius Cupcakes is windsurfing. Work has been eating Jane alive for the
 past two and a half weeks, but after successfully deploying version 3.6.0 of
-the `baker` service that morning, she left early to try to unwind a bit.
+the `baker` service that morning, she escaped early to try to unwind a bit.
 
 Her shoulders are just starting to unknot when her phone pings with a text
 from Julian, down in the NOC. Waterproof phones are a blessing, but also a
 curse.
 
 **Julian**: _Hey Jane. Things are still running, more or less, but latencies
-on everything in the baker namespace are crazy high after your last rollout,
-and baker itself has a weirdly high load. Sorry to interrupt you on the lake
+on everything in the `baker` namespace are crazy high after your last rollout,
+and `baker` itself has a weirdly high load. Sorry to interrupt you on the lake
 but can you take a look? Thanks!!_
 
-Jane stares at the phone for a long moment, then slumps and heads back to
-shore to dry off and grab her laptop.
+Jane stares at the phone for a long moment, heart sinking, then slowly tacks
+back to shore to dry off and grab her laptop.
 
-What she finds is strange. `baker` is taking a _lot_ of load, almost 4x what’s
-being reported by its usual clients, and its clients report that calls are
-taking much longer than they’d expect them to. She doublechecks the
-Deployment, the Service, and all the HTTPRoutes around `baker`; everything
+What she finds when she logs in is strange. `baker` is taking a _lot_ of load,
+almost 4x what’s being reported by its usual clients, and its clients report
+that calls are taking much longer than they’d expect them to. She doublechecks
+the Deployment, the Service, and all the HTTPRoutes around `baker`; everything
 looks good. `baker`’s logs show her mostly failed requests... with a lot of
-duplicate requests? Jane checks her HTTPRoute again, though she's pretty sure
-you can't configure retries there, and finds nothing. But it definitely looks
-like a client is retrying when it shouldn’t be.
+duplicates? Jane checks her HTTPRoute again, though she's pretty sure you
+can't configure retries there, and finds nothing. But it definitely looks like
+a client is retrying when it shouldn’t be.
 
 She pings Julian.
 
@@ -62,7 +62,7 @@ A minute later he answers.
 
 **Jane**: _Dude. I don’t even know how to._ 😂
 
-**Julian**: _You attach a RetryPolicy attached to your HTTPRoute?_
+**Julian**: _You just attach a RetryPolicy to your HTTPRoute._
 
 **Jane**: _Nope. Definitely didn’t do that._
 
@@ -77,7 +77,7 @@ Minutes pass while both look at logs.
 **Jane**: _OK, it’s definitely retrying. Nearly every request fails the first
 few times, gets retried, and then finally succeeds?_
 
-**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests…_
+**Julian**: _Are you sure? I don’t see the `mixer` client making duplicate requests..._
 
 **Jane**: _Check both logs for request ID
 6E69546E-3CD8-4BED-9CE7-45CD3BF4B889. `mixer` sends that once, but `baker`
@@ -121,7 +121,7 @@ RetryPolicy go in? Is it the only thing like it?_
 
 **Julian**: _I didn’t look closely before deleting it, but I think it said a
 few months ago. And there are lots of different kinds of policy and lots of
-individual policies, hang on a minute…_
+individual policies, hang on a minute..._
 
 **Julian**: _Looks like about 47 for your chunk of the world, a couple hundred
 system-wide._
@@ -132,7 +132,7 @@ can’t even_ look _at these things._ 😕
 **Julian**: _That's gonna take awhile. Our tooling to show us which policies
 bind to a given workload doesn't go the other direction._
 
-**Jane**: _…Wait. You have to_ build tools _to figure out basic configuration??_
+**Jane**: _...wait. You have to_ build tools _to figure out basic configuration??_
 
 Pause.
 
@@ -161,9 +161,9 @@ One more look out at the lake.
 The fundamental problem with policy attachment is that it **breaks the core
 premise of Kubernetes as a declarative system**, because it’s not declarative:
 it sets the world up for a sort of spooky action at a distance, to borrow
-Einstein’s phrase. We acknowledge that policy attachement is not the only
-place where we see this in Kubernetes, of course! but we submit that we should
-probably not be adding more such places.
+Einstein’s phrase. Policy attachment is not the only place where we see this
+in Kubernetes, of course! but we submit that we shouldn't be adding any more
+such places.
 
 Given that the fundamental problem is that policy attachement isn't
 declarative as written and should be made declarative, there is only one
@@ -184,7 +184,7 @@ TODO: future iteration
 Isn’t your parable really just showing us that Jane and Julian work for a
 dysfunctional organization?_
 
-**A**: As written, Evil Genius Cupcakes is far from the most dysfunctional
+**A**: As written, Evil Genius Cupcakes is _far_ from the most dysfunctional
 organization I’ve seen. Jane and Julian support each other, neither casts
 blame, both are clearly trying to do their best by the organization and their
 customers even to their own cost. So the organization isn't really the
@@ -203,26 +203,29 @@ Someone suggests retries and they hastily slap in the CRD to enable them. The
 post-mortem gets rescheduled a few times, and/or the person writing up the
 timeline mistakenly notes that the retries were enabled for a given workload
 rather than for the entire namespace, and no one ever figures out that error.
-It creates an action item of “fix this workload to not need retries”, that
-goes into the backlog, and it gets pushed down by more critical items.
+The post-mortem results in an action item of “fix this workload to not need
+retries so we can turn retries off”, that goes into the backlog, and it gets
+pushed down by more critical items.
+
+That is a process problem for sure! but it's a sadly realistic one.
 
 **Q**: _Okay, but in the real world, removing the RetryPolicy wouldn’t affect
 every workload._
 
 **A**: As soon as the namespace-wide RetryPolicy goes in, Jane’s team largely
-loses the backstop of progressive rollout. As long as their workloads don’t
-fail 100% of the time, progressive rollout will likely succeed; after a few
-months, it’s not even close to unlikely that every service will actually be
-failing pretty often.
+loses the backstop of progressive rollout. As long as their workloads succeed
+sometimes, progressive rollout has a good chance to succeed. After the few
+months posited above, it’s not at all unlikely that every service will
+actually be failing pretty often.
 
 **Q**: _Fine. But in the real world, Jane would be able to see all the policy
 objects herself, and this would be a non-issue._
 
-**A**: Quick, write me a kubectl query to fetch every policy CRD that’s
-attached to an arbitrary object. Go ahead. I’ll wait. Make sure you get policy
-CRDs attached to the enclosing namespace, too.
+**A**: Assuming permission to see everything necessary, please write me a
+`kubectl` query to fetch every policy CRD that’s attached to an arbitrary
+object. Remember to get policy CRDs attached to the enclosing namespace, too.
 
-…
+Challenging, no?
 
 There’s a big difference between “having permission to see” and “being able to
 effectively query and understand”. As policy attachment currently stands, you
@@ -232,11 +235,11 @@ couple of different ways that existing tooling isn't very good at.
 **Q**: _Well then, in the real world, Jane would have access to higher-level
 tools that know how to do that._
 
-**A**: Those tools need to be written, and Jane and her team need to be taught
-that the tools exist and how to use them. From Jane’s point of view, those
-tools are adding friction to her job, and honestly she’s right: why should she
-need to learn funky new tools instead of just putting the right thing in her
-HTTPRoutes?
+**A**: Those tools have yet to be written. Once they are, Jane and her team
+will need to be taught that the tools exist and how to use them. From Jane’s
+point of view, it's simpler not to need those tools: she'd rather just put the
+right thing in her HTTPRoutes, and then be able to see them all when she reads
+her HTTPRoutes.
 
 **Q**: _What if we give Julian those tools? He could cope with them._
 
@@ -246,34 +249,42 @@ that.
 
 **Q**: _Doesn't direct policy attachment make things better?_
 
-**A**: Not really, no. The only real effect is that if you use direct policy
-attachment, you can’t land in a scenario that I considered but didn’t write
-about: in that one, Julian tries to tweak the RetryPolicy to disable the
-retries for `baker` alone, but runs afoul of an override installed by Jasmine
-from the cluster-ops team, which Julian doesn’t have permission to change… so
-he literally can’t even turn them off.
+**A**: Not really, no. Direct policy attachment is still spooky action at a
+distance, so it doesn't really make things markedly better.
+
+(That said, direct policy attachment _does_ sidestep a specific very
+unpleasant scenario that I considered but didn’t write about. In that one,
+Julian tries to tweak the RetryPolicy to disable the retries for just the
+`baker` workload, but runs afoul of an override installed by Jasmine from the
+cluster-ops team, which Julian doesn’t have permission to even see... so he
+has to infer the existence of the override he can't see, and he can't do
+anything about it.)
 
 **Q**: _OK, so isn’t this really just a retry thing? It’s not like all
 policies can affect things so broadly._
 
-**A**: Stating the obvious here: the whole point of policy attachment is to
-set policy. By definition, policy has very broad capabilities. Retry is
-actually a fairly narrow function: suppose the attached policy was a WAF which
-was intentionally applied on every namespace (gotta protect everything!), and
-Jasmine mistakenly changed its configuration? That could affect everything in
-the entire cluster – possibly only a week after Jasmine made the change, when
-the WAF gets an update that interacts poorly with the configuration change.
+**A**: To state the obvious, the whole point of policy attachment is to set
+policy -- and by definition, policy has very broad capabilities. Retry is
+actually a fairly _narrow_ function: suppose the attached policy was instead a
+WAF which was intentionally applied on every namespace (gotta protect
+everything!), and Jasmine mistakenly changed its configuration? That could
+affect everything in the entire cluster – possibly only a week after Jasmine
+made the change, when the WAF gets an update that interacts poorly with the
+configuration change.
 
 **Q**: _Dude, c’mon. That’s Jasmine and the WAF shooting themselves in the
 foot, not a problem with policy attachment._
 
 **A**: You’re right that policy attachment didn’t cause the retry issue we
-looked at first, nor would it cause the WAF problem above. But it does make it
-much harder for Jane (the human directly affected) to understand what’s
-happening so she can fix it. That’s the problem that I’m concerned about.
+looked at first, nor would it cause the WAF problem above. What we're
+concerned about is that policy attachement _does_ make it much harder for Jane
+to understand what's happening so that she can fix it. That will have a real
+impact on real people.
+
+**Q**: _So you're just saying that everything is impossible and you're not
+listening to my questions._
 
-**Q**: _So you’re saying this is just impossible then, and you’re not
-listening to anything I ask._
+**A**: Well, most of your "questions" aren't questions! 🙂
 
-**A**: Well, most of your questions aren’t questions! But more importantly,
-see the next section.
+And we definitely think it's possible to do something about the situation;
+that's what this proposal is all about.