Ask a bunch of questions on in-place pod resize

This is not really intended to be committed, but to get the questions debated and answered
kubernetes · Jul 1, 2020 · 91d8810 · 91d8810
1 parent 3b6b4eb
commit 91d8810
Showing 1 changed file with 103 additions and 5 deletions.
diff --git a/keps/sig-node/20181106-in-place-update-of-pod-resources.md b/keps/sig-node/20181106-in-place-update-of-pod-resources.md
@@ -108,12 +108,12 @@ Containers addresses this issue directly.
 
 ### Goals
 
-* Primary: allow to change Pod resource requests & limits without restarting
-  its Containers.
+* Primary: allow to change container resource requests & limits without
+  necessarily restarting the container
 * Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide
   how to proceed if in-place resource resize is not possible.
-* Secondary: allow users to specify which Pods and Containers can be resized
-  without a restart.
+* Secondary: allow users to specify which Containers can be resized without a
+  restart.
 
 ### Non-Goals
 
@@ -153,6 +153,18 @@ Additionally, Kubelet is authorized to update PodSpec, and NodeRestriction
 admission plugin is extended to limit Kubelet's update access only to Pod's
 ResourcesAllocated field for CPU and memory resources.
 
+<<[UNRESOLVED]>>
+Why is "allocated" spec?  The need for special admission control is a strong
+smell.  Why doesn't this belong in status?  If there's a compelling reason,
+this KEP should say so.
+<<[/UNRESOLVED]>>
+
+<<[UNRESOLVED]>>
+Do we want to define a subresource for setting `resources` on pods?  This would
+allow us to grant RBAC access to things like VPA without allowing full write
+access to pod spec.
+<<[/UNRESOLVED]>>
+
 #### Container Resize Policy
 
 To provide fine-grained user control, PodSpec.Containers is extended with
@@ -172,9 +184,24 @@ available memory are more probable to require restarts.
 If more than one resource type with different policies are updated, then
 RestartContainer policy takes precedence over NoRestart policy.
 
-Additionally, if RestartPolicy is 'Never', ResizePolicy should be set to
+Additionally, if RestartPolicy is 'Never', ResizePolicy must be set to
 NoRestart in order to pass validation.
 
+<<[UNRESOLVED]>>
+Should something like RuntimeClass be allowed to say "no in-place restarts are
+allowed", so that users would get a synchronous error if they set this?  Or is
+that overkill?`
+<<[/UNRESOLVED]>>
+
+<<[UNRESOLVED]>>
+`resizePolicy: NoRestart` is not obvious.  To me that reads as "you may not
+restart this, but that's not what it means, right?  Can we phrase itbetter?
+
+Do we need a policy like "RestartNotAllowed" which says the opposite of
+NoRestart - you MAY NOT restart me?  That seems complicated and will leave
+resizes in "limbo", so maybe not.
+<<[/UNRESOLVED]>>
+
 #### CRI Changes
 
 Kubelet calls UpdateContainerResources CRI API which currently takes
@@ -200,6 +227,34 @@ Spec.Containers[i].ResourcesAllocated are used to determine if there is enough
 room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated after
 admitting a new Pod.
 
+<<[UNRESOLVED]>>
+The above is confusing.  Does it mean that ResourcesAllocated will be set by
+someone other than Kubelet?  Who?  Why?
+
+Above you said 'limit access to ResourcesAllocated field such that only Kubelet
+can update this field' which is the opposite of 'kubelet does not set Pod's
+ResourcesAllocated after admitting' here.  I am very confused.
+
+Naively, I would expect it to be:
+
+Create:
+* user creates a pod with `resources`
+* scheduler schedules based on new pod `resources` and existing pods'
+  `allocated`
+* kubelet accepts the pod based on new pod `resources` and existing pods'
+  `allocated`
+* kubelet sets `allocated` = `resources` immediately (acts as an ACK)
+* kubelet sets `actual` = `resources` once created
+
+Update:
+* user updates a pod's `resources`
+* kubelet considers the new `resources` and other pods' `allocated`
+* if viable, kubelet sets `allocated` = `resources` immediately (ACK) and sets
+  `actual` = `resources` once resized
+* if not viable, kubelet does not change `allocated` and instead drops an event
+  and maybe a new status field or condition
+<<[/UNRESOLVED]>>
+
 When a Pod resize is requested, Kubelet attempts to update the resources
 allocated to the Pod and its Containers. Kubelet first checks if the new
 desired resources can fit the Node allocable resources by computing the sum of
@@ -215,13 +270,38 @@ new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
   further action is taken.
   - Kubelet retries the Pod resize at a later time.
 
+<<[UNRESOLVED]>>
+This is close to what I wrote above, but still vague.  What does it mean to
+"reject" a resize?  How do controllers know it has been deferred temporarily
+(will retry) or premantently (no more retries)? Does this put the pod into a
+new condition that higher-level controls need to know?
+<<[/UNRESOLVED]>>
+
 If multiple Pods need resizing, they are handled sequentially in the order in
 which Pod additions and updates arrive at Kubelet.
 
+<<[UNRESOLVED]>>
+I'd call this over-specced.  Kubelet should have freedom to choose higher-prio
+pods or "easy" cases or whatever.
+<<[/UNRESOLVED]>>
+
 Scheduler may, in parallel, assign a new Pod to the Node because it uses cached
 Pods to compute Node allocable values. If this race condition occurs, Kubelet
 resolves it by rejecting that new Pod if the Node has no room after Pod resize.
 
+<<[UNRESOLVED]>>
+What prevents the scheduler from doing the same thing again and again?  This is
+not specific to this KEP, but we don't have a mechanism to prevent this.  E.g.
+we COULD do something like:
+
+* scheduler proposes a pod to a node in a new "earmark" resource
+* kubelet ACKs the earmark with an expiry timestamp
+* scheduler binds the pod to the node
+
+This is obviously more complicated, so we can consider it separately, but it
+feels like a gap that will get worse.
+<<[/UNRESOLVED]>>
+
 #### Kubelet Restart Tolerance
 
 If Kubelet were to restart amidst handling a Pod resize, then upon restart, all
@@ -287,6 +367,11 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
      - Evict the Pod to trigger a replacement Pod with new desired resources,
      - Do nothing and let Kubelet back off and later retry the in-place resize.
 
+<<[UNRESOLVED]>>
+I'm suggesting that there be a real signal indicating "I tried to resize, but I
+can't" beyond the lack of update to `allocated`.
+<<[/UNRESOLVED]>>
+
 #### Container resource limit update ordering
 
 When in-place resize is requested for multiple Containers in a Pod, Kubelet
@@ -332,6 +417,19 @@ Status.ContainerStatuses[i].Resources to match the desired limit values.
   determine if in-place resize is possible.
 * Impact of memory-backed emptyDir volumes: If memory-backed emptyDir is in
   use, Kubelet will clear out any files in emptyDir upon Container restart.
+<<[UNRESOLVED]>>
+This is not true today, IIRC.  tmpfs mounts are not cleared unless the whole
+pod is torn down.  Is this net0new behavior?  How can that be safe, when
+another container might be using those files.
+
+I think, instead, we need to say that resizes (especially but not exclusively
+shrinks) can fail.  A failed resize could cause container restarts even when
+NoRestart is specified.
+
+Can a failed resize cause a full pod teardown (including a new IP assignment,
+emptyDir volumes being wiped, etc)?  I don't know.  It seems consistent, to me,
+but we don't have a clean spec for what that really means.
+<<[/UNRESOLVED]>>
 * At this time, Vertical Pod Autoscaler should not be used with Horizontal Pod
   Autoscaler on CPU, memory. This enhancement does not change that limitation.