Skip to content

Commit

Permalink
cleanups based on discussion
Browse files Browse the repository at this point in the history
  • Loading branch information
thockin committed Jul 6, 2020
1 parent 91d8810 commit b19f391
Showing 1 changed file with 58 additions and 90 deletions.
148 changes: 58 additions & 90 deletions keps/sig-node/20181106-in-place-update-of-pod-resources.md
Expand Up @@ -157,6 +157,12 @@ ResourcesAllocated field for CPU and memory resources.
Why is "allocated" spec? The need for special admission control is a strong
smell. Why doesn't this belong in status? If there's a compelling reason,
this KEP should say so.

Alternate proposal:
- Add Pod.Status.ContainerStatuses[i].Resources and .ResourcesAllocated (so
consumers can see changes)
- Store a local checkpoint holding the "allocated" values, to close the gap in
case of a status wipe or ill-timed reboot.
<<[/UNRESOLVED]>>

<<[UNRESOLVED]>>
Expand All @@ -168,39 +174,30 @@ access to pod spec.
#### Container Resize Policy

To provide fine-grained user control, PodSpec.Containers is extended with
ResizePolicy - a list of named subobjects (new object) that supports 'cpu'
and 'memory' as names. It supports the following policy values:
* NoRestart - the default value; resize Container without restarting it,
* RestartContainer - restart the Container in-place to apply new resource
values. (e.g. Java process needs to change its Xmx flag)

By using ResizePolicy, user can mark Containers as safe (or unsafe) for
in-place resource update. Kubelet uses it to determine the required action.
ResizePolicy - a list of named subobjects (new object) that supports 'cpu' and
'memory' as names. It supports the following policy values:
* RestartNotRequired - default value; try to resize the Container without
restarting it, if possible.
* Restart - the container requires a restart to apply new resource values.
(e.g. Java process needs to change its Xmx flag) By using ResizePolicy, user
can mark Containers as safe (or unsafe) for in-place resource update. Kubelet
uses it to determine the required action.

Note: `RestartNotRequired` does not guarantee that container won't be
restarted. The runtime may choose to stop the container if it is unable to
apply the new resources without doing so.

Setting the flag to separately control CPU & memory is due to an observation
that usually CPU can be added/removed without much problem whereas changes to
available memory are more probable to require restarts.

If more than one resource type with different policies are updated, then
RestartContainer policy takes precedence over NoRestart policy.

Additionally, if RestartPolicy is 'Never', ResizePolicy must be set to
NoRestart in order to pass validation.

<<[UNRESOLVED]>>
Should something like RuntimeClass be allowed to say "no in-place restarts are
allowed", so that users would get a synchronous error if they set this? Or is
that overkill?`
<<[/UNRESOLVED]>>

<<[UNRESOLVED]>>
`resizePolicy: NoRestart` is not obvious. To me that reads as "you may not
restart this, but that's not what it means, right? Can we phrase itbetter?
If more than one resource type with different policies are updated at the same
time, then any `Restart` policy takes precedence over `RestartNotRequired` policies.

Do we need a policy like "RestartNotAllowed" which says the opposite of
NoRestart - you MAY NOT restart me? That seems complicated and will leave
resizes in "limbo", so maybe not.
<<[/UNRESOLVED]>>
If a container's RestartPolicy is `Never`, the ResizePolicy must be set to
`RestartNotRequired` to pass validation. That said, any in-place resize may
result in the container being stopped *and not restarted*, if the system can
not perform the resize in place.

#### CRI Changes

Expand All @@ -224,37 +221,9 @@ Node that accommodates the Pod.
For a newly created Pod, Spec.Containers[i].ResourcesAllocated must match
Spec.Containers[i].Resources.Requests. When Kubelet admits a new Pod, values in
Spec.Containers[i].ResourcesAllocated are used to determine if there is enough
room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated after
room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated when
admitting a new Pod.

<<[UNRESOLVED]>>
The above is confusing. Does it mean that ResourcesAllocated will be set by
someone other than Kubelet? Who? Why?

Above you said 'limit access to ResourcesAllocated field such that only Kubelet
can update this field' which is the opposite of 'kubelet does not set Pod's
ResourcesAllocated after admitting' here. I am very confused.

Naively, I would expect it to be:

Create:
* user creates a pod with `resources`
* scheduler schedules based on new pod `resources` and existing pods'
`allocated`
* kubelet accepts the pod based on new pod `resources` and existing pods'
`allocated`
* kubelet sets `allocated` = `resources` immediately (acts as an ACK)
* kubelet sets `actual` = `resources` once created

Update:
* user updates a pod's `resources`
* kubelet considers the new `resources` and other pods' `allocated`
* if viable, kubelet sets `allocated` = `resources` immediately (ACK) and sets
`actual` = `resources` once resized
* if not viable, kubelet does not change `allocated` and instead drops an event
and maybe a new status field or condition
<<[/UNRESOLVED]>>

When a Pod resize is requested, Kubelet attempts to update the resources
allocated to the Pod and its Containers. Kubelet first checks if the new
desired resources can fit the Node allocable resources by computing the sum of
Expand All @@ -266,41 +235,40 @@ new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
UpdateContainerResources CRI API to update Container resource limits. Once
all Containers are successfully updated, it updates
Pod.Status.ContainerStatuses[i].Resources to reflect new resource values.
* If new desired resources don't fit, Kubelet rejects the resize, and no
further action is taken.
- Kubelet retries the Pod resize at a later time.
* If new desired resources don't fit, Kubelet will not act on the resize, and
no further action is taken at this time. Kubelet will retry the resize at a
later time.

<<[UNRESOLVED]>>
This is close to what I wrote above, but still vague. What does it mean to
"reject" a resize? How do controllers know it has been deferred temporarily
(will retry) or premantently (no more retries)? Does this put the pod into a
new condition that higher-level controls need to know?
As discussed in comments, "no signal" is a poor signal. I advocate for at
least this:

"""
Kubelet will generate Events on the Pod whenever a resize is acted upon or when
a resize is deferred. This will allow humans to know that progress is being
made.
"""

I also advocate for this, though it can be deferred (though I bet we need it
eventually):

"""
When a resize is in progress, Kubelet will create a Condition on the pod,
called ResizeInProgress. When a resize has been deferred, Kubelet will create
a condition called ResizeDeferred.
"""
<<[/UNRESOLVED]>>

If multiple Pods need resizing, they are handled sequentially in the order in
which Pod additions and updates arrive at Kubelet.

<<[UNRESOLVED]>>
I'd call this over-specced. Kubelet should have freedom to choose higher-prio
pods or "easy" cases or whatever.
<<[/UNRESOLVED]>>
If multiple Pods need resizing, they are handled sequentially in an order
defined by the Kubelet (e.g. in order of arrivial).

Scheduler may, in parallel, assign a new Pod to the Node because it uses cached
Pods to compute Node allocable values. If this race condition occurs, Kubelet
resolves it by rejecting that new Pod if the Node has no room after Pod resize.

<<[UNRESOLVED]>>
What prevents the scheduler from doing the same thing again and again? This is
not specific to this KEP, but we don't have a mechanism to prevent this. E.g.
we COULD do something like:

* scheduler proposes a pod to a node in a new "earmark" resource
* kubelet ACKs the earmark with an expiry timestamp
* scheduler binds the pod to the node

This is obviously more complicated, so we can consider it separately, but it
feels like a gap that will get worse.
<<[/UNRESOLVED]>>
Note: After a Pod is rejected, the scheduler could try to reschedule the
replacement pod on the same node that just rejected it. This is a general
statement about Kubernetes and is outside the scope of this KEP.

#### Kubelet Restart Tolerance

Expand All @@ -321,7 +289,7 @@ decisions.
### Flow Control

The following steps denote a typical flow of an in-place resize operation for a
Pod with ResizePolicy set to NoRestart for all its Containers.
Pod with ResizePolicy set to RestartNotRequired for all its Containers.

1. Initiating actor updates Pod's Spec.Containers[i].Resources via PATCH verb.
1. API Server validates the new Resources. (e.g. Limits are not below
Expand Down Expand Up @@ -368,7 +336,7 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
- Do nothing and let Kubelet back off and later retry the in-place resize.

<<[UNRESOLVED]>>
I'm suggesting that there be a real signal indicating "I tried to resize, but I
As above: I'm suggesting that there be a real signal indicating "I tried to resize, but I
can't" beyond the lack of update to `allocated`.
<<[/UNRESOLVED]>>

Expand Down Expand Up @@ -424,7 +392,7 @@ another container might be using those files.

I think, instead, we need to say that resizes (especially but not exclusively
shrinks) can fail. A failed resize could cause container restarts even when
NoRestart is specified.
RestartNotRequired is specified.

Can a failed resize cause a full pod teardown (including a new IP assignment,
emptyDir volumes being wiped, etc)? I don't know. It seems consistent, to me,
Expand Down Expand Up @@ -586,15 +554,15 @@ Setup a namespace with min and max LimitRange and create a single, valid Pod.
### Resize Policy Tests

Setup a guaranteed class Pod with two containers (c1 & c2).
1. No resize policy specified, defaults to NoRestart. Verify that CPU and
1. No resize policy specified, defaults to RestartNotRequired. Verify that CPU and
memory are resized without restarting containers.
1. NoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2.
1. RestartNotRequired (cpu, memory) policy for c1, Restart (cpu, memory) for c2.
Verify that c1 is resized without restart, c2 is restarted on resize.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only,
1. RestartNotRequired cpu, Restart memory policy for c1. Resize c1 CPU only,
verify container is resized without restart.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only,
1. RestartNotRequired cpu, Restart memory policy for c1. Resize c1 memory only,
verify container is resized with restart.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory,
1. RestartNotRequired cpu, Restart memory policy for c1. Resize c1 CPU & memory,
verify container is resized with restart.

### Backward Compatibility and Negative Tests
Expand Down

0 comments on commit b19f391

Please sign in to comment.