Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in TopologyManager hint generation after kubelet restart #84525

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Oct 29, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Previously, the HintProviders for the CPUManager and devicemanager would
attempt to generate (new) hints for already running containers after a
kubelet restart.

This patch adds logic to both the CPUManager and the devicemanager to regenerate hints for containers that already have CPUs and/or devices allocated to them.

Which issue(s) this PR fixes:

Fixes #84479

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 29, 2019
@klueska
Copy link
Contributor Author

klueska commented Oct 29, 2019

/assign @ConnorDoyle

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 29, 2019
@k8s-ci-robot
Copy link
Contributor

@klueska: GitHub didn't allow me to request PR reviews from the following users: cbf123.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @cbf123

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@klueska
Copy link
Contributor Author

klueska commented Oct 29, 2019

/cc @lmdaly @nolancon

Copy link
Contributor

@cbf123 cbf123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, appreciate the fix.

@klueska
Copy link
Contributor Author

klueska commented Oct 29, 2019

@cbf123 would you mind pulling down this branch and giving it a spin to make sure it addresses your issue in a live setting?

@cbf123
Copy link
Contributor

cbf123 commented Oct 29, 2019

@klueska Yep, I should be able to try it out.

@klueska
Copy link
Contributor Author

klueska commented Oct 30, 2019

@cbf123 after sleeping on this solution, I think there may still be a problem.

It's possible that a kubelet goes down after a container has been granted a set of devices, but before a container has been granted a set of CPUs (or vice versa). When this happens, we want the hints generated for the container after the kubelet restart to be the same as they were before the restart so that any unallocated resources (either a set of devices or CPUs in this case) can have their hints combined properly by the TopologyManager. Simply returning a nil implies a "don't care" and has the possibility to mess up alignment under this edge case.

I will rework this patch sometime today and let you know once it's done.

@klueska klueska force-pushed the upstream-fix-hint-generation-after-kubelet-restart branch from 2e04e11 to 583ca94 Compare October 30, 2019 16:55
@klueska
Copy link
Contributor Author

klueska commented Oct 30, 2019

@cbf123 @lmdaly

OK. Update pushed. It has a dependency on the commit in #81344 which is also included as part of this PR.

@cbf123
Copy link
Contributor

cbf123 commented Oct 30, 2019

Basic initial testing shows the previous version of the fix does seem to resolve the original problem. I'll give this one a try too.

@klueska
Copy link
Contributor Author

klueska commented Oct 30, 2019

/retest

@klueska
Copy link
Contributor Author

klueska commented Nov 4, 2019

If you're satisfied with this patch for now at least, can you give it an lgtm so other reviewers know the state of things.

Copy link
Contributor

@cbf123 cbf123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@klueska
Copy link
Contributor Author

klueska commented Nov 4, 2019

I guess the best we could do is error out at container creation if the CPUs being allocated don't actually align with what was expected from the Hint. At least then you error out without allocating CPUs that don't have the alignment you expected.

We can't actually even do that without intertwining the topologymanager policy with the allocation policy (since for the best-effort policy, for example, we want to continue the allocation even if the NUMA affinity isn't satisfied).

This will become especially important as we move to a model where
exclusive CPUs are assigned at pod admission time rather than at pod
creation time.

Having this function will allow us to do garbage collection on these
CPUs anytime we are about to allocate CPUs to a new set of containers,
in addition to reclaiming state periodically in the reconcileState()
loop.
This ensures that we have the most up-to-date state when generating
topology hints for a container. Without this, it's possible that some
resources will be seen as allocated, when they are actually free.
@klueska klueska force-pushed the upstream-fix-hint-generation-after-kubelet-restart branch from b63fbcb to 585adc8 Compare November 5, 2019 14:29
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2019
This patch also includes test to make sure the newly added logic works
as expected.
@klueska
Copy link
Contributor Author

klueska commented Nov 5, 2019

To reiterate, I'm looking for a review about the points in the following comment from someone more knowledgable in this area: #84525 (comment)

/cc @derekwaynecarr
/cc @ConnorDoyle

Copy link
Contributor

@ConnorDoyle ConnorDoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make sense based on the PR description of the fault. Thanks @klueska for again leaving great comments to follow and @cbf123 for adding your review.

/approve

// Strip all devices in use from the list of healthy ones.
return m.healthyDevices[resource].Difference(m.allocatedDevices[resource])
}

func (m *ManagerImpl) getAllocatedDevices(podUID, containerName, resource string) sets.String {
// Pull the list of device IDs directly from the list of podDevices.
return m.podDevices[podUID][containerName][resource].deviceIds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not expected for the 2nd and 3rd level mappings to be nil, but it would be nicer to check (and log and return an empty set I suppose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One better -- there was already a function for this in the podDevices abstraction:

m.podDevices.containerDevices()

@@ -28,26 +28,49 @@ import (
// ensures the Device Manager is consulted when Topology Aware Hints for each
// container are created.
func (m *ManagerImpl) GetTopologyHints(pod v1.Pod, container v1.Container) map[string][]topologymanager.TopologyHint {
deviceHints := make(map[string][]topologymanager.TopologyHint)
// Garbage collect any stranded device resources before providing TopologyHints
m.updateAllocatedDevices(m.activePods())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for moving this out of getAvailableDevices.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorDoyle, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2019
@ConnorDoyle
Copy link
Contributor

Responding to the specific questions called out by @klueska:

  1. I don't think there's a problem with calling activePods or GetPodStatus under the cpumanager mutex. The issue with holding the lock in removeStaleState may be that we do perform some file I/O within that function by virtue of calling m.policy.RemoveContainer. Ideally we could find a way to avoid that.
  2. Could you walk through the sequence of events that causes the pod status to be missing when GetTopologyHints is called on the cpumanager? As you pointed out in the comment, it seems like avoiding that possibility would have to wait until the cpumanager state is refactored to store (pod.Name, container.Name) composite keys instead of container ID. So even if it is possible, I'm not sure the problem is addressable in this patch.

/cc @derekwaynecarr for a 2nd opinion, he may have more background on pod status guarantees

This patch also includes test to make sure the newly added logic works
as expected.
@klueska klueska force-pushed the upstream-fix-hint-generation-after-kubelet-restart branch from 6b0bd6d to 4d4d4bd Compare November 6, 2019 15:30
@klueska
Copy link
Contributor Author

klueska commented Nov 6, 2019

@ConnorDoyle Thanks for the review, and thanks for the response to my questions!

I agree theres nothing more we can do in this PR if a pod status is not ready at the point we need it to be. It would be good to know if it is at all possible to not have a vaild status at this point though. Let's wait and what @derekwaynecarr's response is.

With that said, I think all that's missing now is an /lgtm now that I've addressed: #84525 (review)

@ConnorDoyle
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2019
@klueska
Copy link
Contributor Author

klueska commented Nov 6, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 08e5781 into kubernetes:master Nov 6, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rebooting node (and maybe just restarting kubelet) results in running pod going "Failed"
7 participants