New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the races around devicemanager Allocate() and endpoint deletion. #60856

Merged
merged 1 commit into from Mar 12, 2018

Conversation

@jiayingz
Member

jiayingz commented Mar 6, 2018

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60176

Special notes for your reviewer:

Release note:

Fixes the races around devicemanager Allocate() and endpoint deletion.
@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 6, 2018

@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Mar 6, 2018

/area hw-accelerators

@RenaudWasTaken

Wouldn't it be simpler to issue a callback on endpoint stop and add that timeout at the manager level?

That would remove this weird intermediary of dead but not removed state in the endpoint and simplify the guarding code you added to all the RPC calls.

You also wouldn't need to check for devices not in sync with the endpoint since you'd always have the stopGracePeriod.

And as far as re-registration is concerned by the stop callback, you would just need to replace the callback func by a noop func before stopping the endpoint.

// because its device plugin fails. DeviceManager keeps the stopped endpoint in its
// cache during this grace period to cover the time gap for the capacity change to
// take effect.
const endpointStopGracePeriod = time.Duration(5) * time.Minute

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Mar 6, 2018

Member

maybe move this in types.go this sounds like an important constant :)

@dims

This comment has been minimized.

Member

dims commented Mar 6, 2018

/sig node

@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 6, 2018

@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Mar 7, 2018

Are you suggesting to add another map in manager to track endpoint stop grace period? I did consider that approach at the beginning but felt we would need to manage lifecycle of another data structure and make sure it is consistent with endpoint update.

We probably need to move the devices in it's own structure anyways as you mentioned managing the lifcycle of these structures in the manager is complex and needs to be carefully checked.

What do you think of something like:

type ManagerStore interface {
     // This sets the timer when Update(rName, [], [], allTheDevices)
     Update(rName string, added, updated, deleted []pluginapi.Device)

     // This decides whether an a resource may be removed based on the time since the last Update
     GetCapacity() (capacity, allocatable v1.ResourceList, deleted []string)
}
@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 7, 2018

@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 7, 2018

/test pull-kubernetes-kubemark-e2e-gce

1 similar comment
@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 7, 2018

/test pull-kubernetes-kubemark-e2e-gce

@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 7, 2018

/assign @vishh

// TODO: Reuse devices between init containers and regular containers.
for _, container := range pod.Spec.InitContainers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {

This comment has been minimized.

@vishh

vishh Mar 9, 2018

Member

nit: why change this line? It is idiomatic go style.

// TODO: Reuse devices between init containers and regular containers.
for _, container := range pod.Spec.InitContainers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
allocatedDevices, err := m.allocateContainerResources(pod, &container, devicesToReuse)

This comment has been minimized.

@vishh

vishh Mar 9, 2018

Member

Doesn't the new API support allocating multiple device at once?

@@ -259,18 +263,39 @@ func (m *ManagerImpl) Devices() map[string][]pluginapi.Device {
func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {

This comment has been minimized.

@vishh

vishh Mar 9, 2018

Member

Can this logic be re-structured as follows to improve readability?

  1. Figure out max(# of devices requested by init containers)
  2. Figure out sum(# of devices requested by regular containers)
  3. Compute devices required - max(2 & 3)
  4. Allocate these devices in some order (either individually or in batch)
  5. Assign devices to init containers (can intersect) and regular containers (mutually independent).

As of now the logic is dense. Since your PR is touching this logic, I'd like for it to be cleaned up.

@vishh

This comment has been minimized.

Member

vishh commented Mar 9, 2018

given that kubelet processed incoming pods serially during admission, this change can potentially block non device plugin pods from starting up right?
Instead should we consider a model where pods that do use only first class resources do not get impacted by device plugins?
Imagine a cluster admin trying to run a container on a specific node to exec and debug the host and the admin's debug pod not starting on the node because of this change.

@k8s-ci-robot k8s-ci-robot removed the approved label Mar 9, 2018

@jiayingz

This comment has been minimized.

Member

jiayingz commented Mar 9, 2018

Per the offline discussion with @vishh, given that pod admission is run serially by a single process and there is a chance that the device plugin pod may even be queued behind a pod that requests the device plugin resource, it seems better for now to just fail the pod admission if it requests a disconnected device plugin resource. In 1.11, we should probably move Allocate grpc call outside of pod admission so that we can have certain retry grace period. I modified the PR to only include endpoint stopGracePeriod part so that devicemanager can properly fail pod admission during that time window. Also modified the device plugin e2e_node test to make sure we don't create pods too early after kubelet restart so that they won't fail admission. PTAL.

@vishh

This comment has been minimized.

Member

vishh commented Mar 9, 2018

/lgtm
/approve

@fejta-bot

This comment has been minimized.

fejta-bot commented Mar 10, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 10, 2018

@vishh vishh added this to the v1.10 milestone Mar 10, 2018

@dims

This comment has been minimized.

Member

dims commented Mar 10, 2018

/test pull-kubernetes-e2e-gce

@vikaschoudhary16

This comment has been minimized.

Member

vikaschoudhary16 commented Mar 12, 2018

/lgtm

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 12, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vikaschoudhary16, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 12, 2018

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@jiayingz @vikaschoudhary16 @vishh

Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.10 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.

Help
@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 12, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 12, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit a3f40dd into kubernetes:master Mar 12, 2018

12 of 14 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-kubemark-e2e-gce
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job triggered.
Details
cla/linuxfoundation jiayingz authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-merge-robot added a commit that referenced this pull request Mar 22, 2018

Merge pull request #61060 from jiayingz/automated-cherry-pick-of-#60856
…-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60856

Cherry pick of #60856 on release-1.9.

#60856: Fixes the races around devicemanager Allocate() and endpoint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment