Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update logic in CPUManager reconcileState() #84300

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Oct 24, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR cleans up the logic for reconcileState() in the CPUManager.

As background.....

The CPUManager maintains state about the CPUSet that should be associated with any given container on a node. For containers that require a dedicated set of CPUs, the CPUManager tracks a mapping of (containerID)->(CPUSet), where this CPUSet contains the list of dedicated CPUs that the CPUManager has granted to the container. For containers that do not require a dedicated set of CPUs, no explicit state is maintained by the CPUManager about these containers. Instead, a DefaultCPUSet is maintained, which contains the set of all CPUs that are not part of any (containerID)->(CPUSet) mapping.

Once this state has been established, the actual CPUSet of a container can be updated via an Update(CPUSet) call on the specific ContainerRuntime in use. As each container is added to the system, a single ContainerRuntime.Udate(CPUSet) call is made with the appropriate CPUSet for the container.

As you can imagine, however, as containers come and go in the system the DefaultCPUSet can change quite frequently. Whenever it does, the containers at the mercy of the DefaultCPUSet need to have their CPUSet's updated via a new call to ContainerRuntime.Udate(DefaultCPUSet). However, at present, there is no easy path to synchronously update these containers whenever the DefaultCPUSet changes.

Instead, a function called reconcileState() is run in an asynchronous loop every reconcilePeriod seconds in order to accomplish this task. It looks at the set of all active containers on the node, and calls ContainerRuntime.Update(CPUSet) with the appropriate CPUSet for it. While this doesn't guarantee that all containers have the correct CPUSet associated with them at all times, it does guarantee that all containers converge to the correct value every reconcilePeriod seconds.

Unfortunately, the logic inside reconcileState() has become convoluted over time. While the basic idea behind reconcileState() is fairly straightforward, edge cases were found that caused the basic flow to diverge from its original intended purpose.

For example, there is currently a path inside reconcileState() that makes a call out to AddContainer() if an active container is found that has no CPUSet associated with it. Presumably, this was added to cover the case where reconcileState() began to execute asynchronous to the container in question actually being created. Since AddContainer() was designed to be idempotent, whoever got to the call first (either reconcileState() or the container creation path itself) would do the AddContainer() and everything could continue forward as expected

As we know, however, any containers that don't have dedicated CPUs, also don't have a CPUSet associated with them. This means that this AddContainer() call is erroneously being called on all containers that don't have any dedicated CPUs associated with them. This is OK because of the idempotency of the AddContainer() call, but it convolutes the logic significantly.

One of the reasons that this AddContainer() call is necessary inside reconcileState() is because there is currently no way "skip" a container that is not ready for further processing yet. All logic to decide if a container should be known by the CPUManager is gleaned from looking at the existence of the container in the PodStatus (regardless of that container's specific state).

This PR attempts to clean this up and make the logic inside reconcileState() a bit more sane. It does this through a combination of the containerMap introduced in #84196 and moving to a model that looks at the specific state of a given container inside the PodStatus rather than just looking for the existence of the container in the PodStatus.

The containerMap makes it so we know for sure whether the container has completed an AddContainer() call and should have its state reconciled. We should simply skip it if it has not.

Using the state of the container let's us decide what to do, depending on whether it is currently waiting, running, or terminated. When waiting we skip with a warning. When terminated we skip without warning and remove it so that it is never attempted again in the future. Only when running do we continue on to attempt a reconciliation.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 24, 2019
@klueska klueska changed the title Upstream cpu manager reconcile on container state Update logic in CPUManager reconcileState() Oct 24, 2019
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2019
@klueska klueska force-pushed the upstream-cpu-manager-reconcile-on-container-state branch from 89cc0ec to 158ba82 Compare October 24, 2019 16:18
@liggitt liggitt removed their request for review October 24, 2019 16:28
@klueska
Copy link
Contributor Author

klueska commented Oct 24, 2019

/assign @ConnorDoyle

@derekwaynecarr
Copy link
Member

@klueska i assume we want to settle on #84196 first?

@klueska klueska changed the title Update logic in CPUManager reconcileState() [WIP] Update logic in CPUManager reconcileState() Dec 4, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2019
@klueska
Copy link
Contributor Author

klueska commented Dec 4, 2019

/hold

Will revisit once #84462 is merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2019
@klueska klueska closed this Dec 4, 2019
@klueska klueska reopened this Dec 4, 2019
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2019
@klueska klueska force-pushed the upstream-cpu-manager-reconcile-on-container-state branch from 158ba82 to 94a0cb6 Compare December 4, 2019 14:54
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 4, 2019
@klueska klueska force-pushed the upstream-cpu-manager-reconcile-on-container-state branch from 94a0cb6 to 0c78497 Compare January 20, 2020 14:04
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 20, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2020
@klueska klueska force-pushed the upstream-cpu-manager-reconcile-on-container-state branch from 0c78497 to 7be9b0f Compare January 20, 2020 14:31
@klueska klueska changed the title [WIP] Update logic in CPUManager reconcileState() Update logic in CPUManager reconcileState() Jan 20, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2020
klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
err := m.AddContainer(pod, &container, containerID)
if cstatus.State.Waiting != nil ||
(cstatus.State.Waiting == nil && cstatus.State.Running == nil && cstatus.State.Terminated == nil) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious here and making sure it's not an oversight.. If all three states (waiting, running and terminated) are all nil, this also means that the correct state is in fact 'waiting'?

Copy link
Contributor Author

@klueska klueska Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. From the comment underneath https://godoc.org/k8s.io/api/core/v1#ContainerState:

ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.

@nolancon
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2020
@klueska
Copy link
Contributor Author

klueska commented Jan 20, 2020

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 20, 2020
@klueska
Copy link
Contributor Author

klueska commented Jan 20, 2020

/test pull-kubernetes-node-kubelet-serial-cpu-manager

@klueska
Copy link
Contributor Author

klueska commented Jan 20, 2020

/retest

@k8s-ci-robot k8s-ci-robot merged commit e6b5194 into kubernetes:master Jan 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 20, 2020
@vpickard
Copy link
Contributor

/test pull-kubernetes-node-kubelet-serial-cpu-manager

@haosdent
Copy link
Member

@klueska after this pr, if the container has been assigned a dedicated cpuset, and it restart inside the pod. It would become default cpuset after restart instead of the dedicated cpuset set before.

@klueska
Copy link
Contributor Author

klueska commented Apr 25, 2020

Yes. That was an unfortunate oversight. Please see the following for the (long) discussion and PR to fix it:

https://kubernetes.slack.com/archives/C0BP8PW9G/p1587155932390500

#90377

@haosdent
Copy link
Member

thx @klueska , let me the pr you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants