Skip to content

Commit

Permalink
Ask a bunch of questions on in-place pod resize
Browse files Browse the repository at this point in the history
This is not really intended to be committed, but to get the questions
debated and answered
  • Loading branch information
thockin committed Jul 1, 2020
1 parent 3b6b4eb commit 91d8810
Showing 1 changed file with 103 additions and 5 deletions.
108 changes: 103 additions & 5 deletions keps/sig-node/20181106-in-place-update-of-pod-resources.md
Expand Up @@ -108,12 +108,12 @@ Containers addresses this issue directly.

### Goals

* Primary: allow to change Pod resource requests & limits without restarting
its Containers.
* Primary: allow to change container resource requests & limits without
necessarily restarting the container
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide
how to proceed if in-place resource resize is not possible.
* Secondary: allow users to specify which Pods and Containers can be resized
without a restart.
* Secondary: allow users to specify which Containers can be resized without a
restart.

### Non-Goals

Expand Down Expand Up @@ -153,6 +153,18 @@ Additionally, Kubelet is authorized to update PodSpec, and NodeRestriction
admission plugin is extended to limit Kubelet's update access only to Pod's
ResourcesAllocated field for CPU and memory resources.

<<[UNRESOLVED]>>
Why is "allocated" spec? The need for special admission control is a strong
smell. Why doesn't this belong in status? If there's a compelling reason,
this KEP should say so.
<<[/UNRESOLVED]>>

<<[UNRESOLVED]>>
Do we want to define a subresource for setting `resources` on pods? This would
allow us to grant RBAC access to things like VPA without allowing full write
access to pod spec.
<<[/UNRESOLVED]>>

#### Container Resize Policy

To provide fine-grained user control, PodSpec.Containers is extended with
Expand All @@ -172,9 +184,24 @@ available memory are more probable to require restarts.
If more than one resource type with different policies are updated, then
RestartContainer policy takes precedence over NoRestart policy.

Additionally, if RestartPolicy is 'Never', ResizePolicy should be set to
Additionally, if RestartPolicy is 'Never', ResizePolicy must be set to
NoRestart in order to pass validation.

<<[UNRESOLVED]>>
Should something like RuntimeClass be allowed to say "no in-place restarts are
allowed", so that users would get a synchronous error if they set this? Or is
that overkill?`
<<[/UNRESOLVED]>>

<<[UNRESOLVED]>>
`resizePolicy: NoRestart` is not obvious. To me that reads as "you may not
restart this, but that's not what it means, right? Can we phrase itbetter?

Do we need a policy like "RestartNotAllowed" which says the opposite of
NoRestart - you MAY NOT restart me? That seems complicated and will leave
resizes in "limbo", so maybe not.
<<[/UNRESOLVED]>>

#### CRI Changes

Kubelet calls UpdateContainerResources CRI API which currently takes
Expand All @@ -200,6 +227,34 @@ Spec.Containers[i].ResourcesAllocated are used to determine if there is enough
room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated after
admitting a new Pod.

<<[UNRESOLVED]>>
The above is confusing. Does it mean that ResourcesAllocated will be set by
someone other than Kubelet? Who? Why?

Above you said 'limit access to ResourcesAllocated field such that only Kubelet
can update this field' which is the opposite of 'kubelet does not set Pod's
ResourcesAllocated after admitting' here. I am very confused.

Naively, I would expect it to be:

Create:
* user creates a pod with `resources`
* scheduler schedules based on new pod `resources` and existing pods'
`allocated`
* kubelet accepts the pod based on new pod `resources` and existing pods'
`allocated`
* kubelet sets `allocated` = `resources` immediately (acts as an ACK)
* kubelet sets `actual` = `resources` once created

Update:
* user updates a pod's `resources`
* kubelet considers the new `resources` and other pods' `allocated`
* if viable, kubelet sets `allocated` = `resources` immediately (ACK) and sets
`actual` = `resources` once resized
* if not viable, kubelet does not change `allocated` and instead drops an event
and maybe a new status field or condition
<<[/UNRESOLVED]>>

When a Pod resize is requested, Kubelet attempts to update the resources
allocated to the Pod and its Containers. Kubelet first checks if the new
desired resources can fit the Node allocable resources by computing the sum of
Expand All @@ -215,13 +270,38 @@ new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
further action is taken.
- Kubelet retries the Pod resize at a later time.

<<[UNRESOLVED]>>
This is close to what I wrote above, but still vague. What does it mean to
"reject" a resize? How do controllers know it has been deferred temporarily
(will retry) or premantently (no more retries)? Does this put the pod into a
new condition that higher-level controls need to know?
<<[/UNRESOLVED]>>

If multiple Pods need resizing, they are handled sequentially in the order in
which Pod additions and updates arrive at Kubelet.

<<[UNRESOLVED]>>
I'd call this over-specced. Kubelet should have freedom to choose higher-prio
pods or "easy" cases or whatever.
<<[/UNRESOLVED]>>

Scheduler may, in parallel, assign a new Pod to the Node because it uses cached
Pods to compute Node allocable values. If this race condition occurs, Kubelet
resolves it by rejecting that new Pod if the Node has no room after Pod resize.

<<[UNRESOLVED]>>
What prevents the scheduler from doing the same thing again and again? This is
not specific to this KEP, but we don't have a mechanism to prevent this. E.g.
we COULD do something like:

* scheduler proposes a pod to a node in a new "earmark" resource
* kubelet ACKs the earmark with an expiry timestamp
* scheduler binds the pod to the node

This is obviously more complicated, so we can consider it separately, but it
feels like a gap that will get worse.
<<[/UNRESOLVED]>>

#### Kubelet Restart Tolerance

If Kubelet were to restart amidst handling a Pod resize, then upon restart, all
Expand Down Expand Up @@ -287,6 +367,11 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
- Evict the Pod to trigger a replacement Pod with new desired resources,
- Do nothing and let Kubelet back off and later retry the in-place resize.

<<[UNRESOLVED]>>
I'm suggesting that there be a real signal indicating "I tried to resize, but I
can't" beyond the lack of update to `allocated`.
<<[/UNRESOLVED]>>

#### Container resource limit update ordering

When in-place resize is requested for multiple Containers in a Pod, Kubelet
Expand Down Expand Up @@ -332,6 +417,19 @@ Status.ContainerStatuses[i].Resources to match the desired limit values.
determine if in-place resize is possible.
* Impact of memory-backed emptyDir volumes: If memory-backed emptyDir is in
use, Kubelet will clear out any files in emptyDir upon Container restart.
<<[UNRESOLVED]>>
This is not true today, IIRC. tmpfs mounts are not cleared unless the whole
pod is torn down. Is this net0new behavior? How can that be safe, when
another container might be using those files.

I think, instead, we need to say that resizes (especially but not exclusively
shrinks) can fail. A failed resize could cause container restarts even when
NoRestart is specified.

Can a failed resize cause a full pod teardown (including a new IP assignment,
emptyDir volumes being wiped, etc)? I don't know. It seems consistent, to me,
but we don't have a clean spec for what that really means.
<<[/UNRESOLVED]>>
* At this time, Vertical Pod Autoscaler should not be used with Horizontal Pod
Autoscaler on CPU, memory. This enhancement does not change that limitation.

Expand Down

0 comments on commit 91d8810

Please sign in to comment.