Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exclusive CPU allocations being deleted at container restart #90377

Merged
merged 1 commit into from Apr 27, 2020

Conversation

cbf123
Copy link
Contributor

@cbf123 cbf123 commented Apr 22, 2020

What type of PR is this?

/kind bug

What this PR does / why we need it:

The expectation is that exclusive CPU allocations happen at pod
creation time. When a container restarts, it should not have its
exclusive CPU allocations removed, and it should not need to
re-allocate CPUs.

There are a few places in the current code that look for containers
that have exited and call CpuManager.RemoveContainer() to clean up
the container. This will end up deleting any exclusive CPU
allocations for that container, and if the container restarts within
the same pod it will end up using the default cpuset rather than
what should be exclusive CPUs.

Removing those calls and adding resource cleanup at allocation
time should get rid of the problem.

Which issue(s) this PR fixes:

Fixes #90303

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes regression in CPUManager that caused freeing of exclusive CPUs at incorrect times 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 22, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @cbf123. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cbf123
Copy link
Contributor Author

cbf123 commented Apr 22, 2020

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 22, 2020
@klueska
Copy link
Contributor

klueska commented Apr 22, 2020

/assign @klueska

@klueska
Copy link
Contributor

klueska commented Apr 22, 2020

We had a long discussion about this here for context:
https://kubernetes.slack.com/archives/C0BP8PW9G/p1587155932390500

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes requested. However it would be nice to see a test or two added that trigger the bug before the change, but fix it after. That way regressions like this won't happen in the future.

}
// We can't safely call i.cpuManager.RemoveContainer(containerID)
// here. Regular containers could be in the process of restarting, and
// RemoveContainer() would remove any allocated exclusive CPUs that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the indenting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops...editor was set to use spaces instead of tabs.

}
// We can't safely call i.cpuManager.RemoveContainer(containerID)
// here. Regular containers could be in the process of restarting, and
// RemoveContainer() would remove any allocated exclusive CPUs that the
Copy link
Contributor

@klueska klueska Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother with this comment here. I know we used to have code here that called RemoveContainer(), but it's more confusing to see the comment out of context, than to not see it at all -- especially since there is no path for calling RemoveContainer() from any external hooks anymore.

I actually plan to go through the TopologyManager after this and remove it's callout as well -- thus removing the need for this hook (and the InternalContainerLifecycle interface) altogether.

@klueska
Copy link
Contributor

klueska commented Apr 22, 2020

As some more background, this is a regression due to the refactoring of the CPUManager that happened as part of:

https://github.com/kubernetes/kubernetes/pull/87759/commits

As part of that refactoring, the CPUManager moved to a model where CPUs are now allocated across all containers at pod admission time rather than as each individual container comes online. Since all CPUs are allocated at pod admission time, they can only properly be freed back to the shared pool at pod deletion time (or lazily after the pod is already gone).

In the old model, CPUs were allocated to each container as it came online (as part of a container pre-start-hook) so we were free to (and in fact required to) free them as each container exited (as part of a post-stop-hook). This is problematic in the new model, however, since CPUs are now assumed to retain their assignment to a container for the lifetime of a pod. As an oversight, the logic was left in place to do this freeing on each container exit instead of waiting for pod deletion. This causes problems (for example) when a container restarts without causing its bounding pod to be restarted.

This patch updates the CPUManager to make sure that CPUs are only ever freed back to the shared pool after a pod has been deleted. It does this by lazily calling the existing removeStaleState() function at appropriate times instead of directly calling RemoveContainer() at container exit. The removeStaleState() function itself walks through the CPUManager state and frees any CPUs not bound to actively running pods.

We now make this call at three locations in the code:

  1. At the top of the GetTopologyHints() call, just before a new pod runs it's logic to generate hints for the TopologyManager. This ensures it will have access to any "newly" available CPUs from terminated pods when generating these hints.

  2. At the top of the Allocate() call, just before a new pod runs it's logic to allocate CPUs. This ensures it will have access to any "newly" available CPUs from terminated pods when performing new allocations.

  3. Periodically, as part of the existing reconcileState() function. This guarantees that CPUs will be freed from terminated pods at least once per reconcile period (currently 10 seconds) in the case that no new pods enter the system and trigger the removeStaleState() function as part of Allocate().

In theory, we don't need 2 when the TopologyManager is enabled because (1) and (2) are in the same synchronous loop. However, not all setups enable the TopologyManager, so it is required in both places for now.

In the future we should consider adding a hook for pod deletion instead of doing the lazy cleanup as part of (1) and (2). We will likely always do the lazy cleanup as part of (3) however, just to make sure we we are always in a sane state.

@klueska
Copy link
Contributor

klueska commented Apr 23, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 23, 2020
Comment on lines 43 to 48
type mockSourcesReady struct{}

func (s *mockSourcesReady) AddSource(source string) {}

func (s *mockSourcesReady) AllReady() bool { return false }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already exists a sourcesReadyStub you can use here instead of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this should fix your gofmt error in the pull-kubernetes-verify test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourcesReadyStub.AllReady() returns "true", which causes the new call to removeStaleState() to try to actually do stuff, which causes all sorts of grief

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd rather not artifically turn off valid code paths that might uncover other underlying problems that are lurking though.

Below is a diff to your current patch that deals with the issue you are seeing in a way that's more consistent with the rest of the tests in this file:

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager_test.go b/pkg/kubelet/cm/cpumanager/cpu_manager_test.go
index 7a6724d76c2..bd18b2c95eb 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager_test.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager_test.go
@@ -40,12 +40,6 @@ import (
 	"k8s.io/kubernetes/pkg/kubelet/cm/topologymanager"
 )

-type mockSourcesReady struct{}
-
-func (s *mockSourcesReady) AddSource(source string) {}
-
-func (s *mockSourcesReady) AllReady() bool { return false }
-
 type mockState struct {
 	assignments   state.ContainerCPUAssignments
 	defaultCPUSet cpuset.CPUSet
@@ -275,14 +269,14 @@ func TestCPUManagerAdd(t *testing.T) {
 				err: testCase.updateErr,
 			},
 			containerMap:      containermap.NewContainerMap(),
-			activePods:        func() []*v1.Pod { return nil },
 			podStatusProvider: mockPodStatusProvider{},
+			sourcesReady:      &sourcesReadyStub{},
 		}

-		mgr.sourcesReady = &mockSourcesReady{}
-
 		pod := makePod("fakePod", "fakeContainer", "2", "2")
 		container := &pod.Spec.Containers[0]
+		mgr.activePods = func() []*v1.Pod { return []*v1.Pod{pod} }
+
 		err := mgr.Allocate(pod, container)
 		if !reflect.DeepEqual(err, testCase.expAllocateErr) {
 			t.Errorf("CPU Manager Allocate() error (%v). expected error: %v but got: %v",
@@ -495,12 +489,13 @@ func TestCPUManagerAddWithInitContainers(t *testing.T) {
 			state:             state,
 			containerRuntime:  mockRuntimeService{},
 			containerMap:      containermap.NewContainerMap(),
-			activePods:        func() []*v1.Pod { return nil },
 			podStatusProvider: mockPodStatusProvider{},
+			sourcesReady:      &sourcesReadyStub{},
+			activePods: func() []*v1.Pod {
+				return []*v1.Pod{testCase.pod}
+			},
 		}

-		mgr.sourcesReady = &mockSourcesReady{}
-
 		containers := append(
 			testCase.pod.Spec.InitContainers,
 			testCase.pod.Spec.Containers...)
@@ -1031,14 +1026,14 @@ func TestCPUManagerAddWithResvList(t *testing.T) {
 				err: testCase.updateErr,
 			},
 			containerMap:      containermap.NewContainerMap(),
-			activePods:        func() []*v1.Pod { return nil },
 			podStatusProvider: mockPodStatusProvider{},
+			sourcesReady:      &sourcesReadyStub{},
 		}

-		mgr.sourcesReady = &mockSourcesReady{}
-
 		pod := makePod("fakePod", "fakeContainer", "2", "2")
 		container := &pod.Spec.Containers[0]
+		mgr.activePods = func() []*v1.Pod { return []*v1.Pod{pod} }
+
 		err := mgr.Allocate(pod, container)
 		if !reflect.DeepEqual(err, testCase.expAllocateErr) {
 			t.Errorf("CPU Manager Allocate() error (%v). expected error: %v but got: %v",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test changes, applied.

@klueska
Copy link
Contributor

klueska commented Apr 23, 2020

Also, can you update the release note to:

Fixes regression in CPUManager that caused freeing of exclusive CPUs at incorrect times 

Even though this is not a user-facing change, we plan to backport this to the 1.18 branch, and having a release note in the original PR eases this process.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 23, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 23, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cbf123, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2020
@k8s-ci-robot
Copy link
Contributor

@cbf123: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kind ab5870d link /test pull-kubernetes-e2e-kind

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cbf123
Copy link
Contributor Author

cbf123 commented Apr 27, 2020

/test pull-kubernetes-e2e-kind

@k8s-ci-robot k8s-ci-robot merged commit 7fdc127 into kubernetes:master Apr 27, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Apr 27, 2020
@cbf123 cbf123 deleted the container_cpuset_fixup_2 branch April 27, 2020 20:50
k8s-ci-robot added a commit that referenced this pull request May 30, 2020
…7-upstream-release-1.18

Automated cherry pick of #90377: Fix exclusive CPU allocations being deleted at container
cynepco3hahue pushed a commit to cynepco3hahue/kubernetes that referenced this pull request Jun 2, 2020
Fix exclusive CPU allocations being deleted at container restart
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jun 10, 2020
…ainer restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jun 10, 2020
…ainer restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jun 10, 2020
…iner restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jun 11, 2020
…iner restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jul 12, 2020
…iner restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jul 20, 2020
…iner restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
cynepco3hahue pushed a commit to cynepco3hahue/origin that referenced this pull request Jul 23, 2020
…iner restart

ref: kubernetes/kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Aug 5, 2020
…iner restart

ref: kubernetes#90377

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>

Origin-commit: 3b9312345f11741b1ce1779bc644bf5441cae2c4
@hex108
Copy link
Contributor

hex108 commented Nov 18, 2020

Could we cherry pick it to release-1.17? Thanks! @klueska @cbf123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Container cpuset lost, apparently due to race between PostStopContainer() and new container creation
5 participants