Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: updated logic of verifying a static critical pod #75144

Merged
merged 2 commits into from Mar 14, 2019

Conversation

@Huang-Wei
Copy link
Member

commented Mar 7, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

This is a supplemental fix of #74222.

The case in that #74222 completely replaces static pods with the mirror pods. But static pods contains important annotations such as kubernetes.io/config.source. For static pods is file, for mirror is api. As a result, eviction manager supposes that pod is not static.

This PR revers part of #74222 (e2e test is still valid and is kept), and updated the logic of verifying a static critical pod:

  • check if a pod is static by its static pod info
  • meanwhile, check if a pod is critical by its corresponding mirror pod info

Which issue(s) this PR fixes:

Supplemental fix for #73572

Special notes for your reviewer:

Here is a detailed explanation on why previous PR passed the e2e test:

  • As @namreg pointed out, the critical static pod failed on IsStaticPod() check (because of mirror pod is tagged as "kubernetes.io/config.source: api")
  • And it's reasonable that it's being marked as "to be deleted":
    I0307 18:09:26.138497   31769 eviction_manager.go:564] eviction manager: pod static-web-127.0.0.1_kube-system(513a7ec5-4136-11e9-b8f7-0cc47a8f8932) is evicted successfully
    
  • However, as @ravisantoshgudimetla observed, there is some weird log showing:
    I0307 18:09:27.138643   31769 kubelet_pods.go:918] Pod "static-web-127.0.0.1_kube-system(513a7ec5-4136-11e9-b8f7-0cc47a8f8932)" is terminated, but some containers are still running
    
  • It's because the pod we passed onto pkg/kubelet/pod_workers.go#killPodNow() is a mirror pod, and it's deleted (or probably a dummy delete) by container runtime without any error. To take an analogy, it's sort of like a "controllee" (mirror pod) gets deleted, but "controller" (static pod) is still alive, so it causes data inconsistency, and the above error msg popped up.
  • And in user's perspective, the static pod is "not" evicted, then it "looks" like previous PR works.

Does this PR introduce a user-facing change?:

kubelet: updated logic of verifying a static critical pod.

/sig node
/sig scheduling

/cc @ravisantoshgudimetla @derekwaynecarr

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2019

/retest

2 similar comments
@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

/retest

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

/retest

@namreg

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@namreg

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

Just curious, maybe we should tag mirror pods as “kubernetes.io/config.source: file”?

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Probably not, unless we have a strong reason.

if mirrorPod, ok := m.mirrorPodFunc(pod); ok && mirrorPod != nil {
isCriticalPod = kubelettypes.IsCriticalPod(mirrorPod)
} else {
isCriticalPod = kubelettypes.IsCriticalPod(pod)

This comment has been minimized.

Copy link
@dashpole

dashpole Mar 12, 2019

Contributor

do we still need this case? The description made it sound like we only needed to check the mirror pod.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Mar 12, 2019

Author Member

@dashpole Theoretically yes. The else block here is just to catch some exceptional case such like a static pod doesn't have a corresponding mirror pod in memory.

This comment has been minimized.

Copy link
@dashpole

dashpole Mar 12, 2019

Contributor

IIUC, static pods will never be critical? Or is that possible if they are using the critical pod annotation, rather than priority?

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Mar 12, 2019

Author Member

In most cases, static pods are critical.

Critical pod annotation is another story, it's now deprecated in favor of PriorityClass. You can think of it as another style of "system-node-critical" priority.

This comment has been minimized.

Copy link
@dashpole

dashpole Mar 12, 2019

Contributor

Right, i'm asking if the reason why we care about the "exceptional case" is because some static pods could be critical (even if we don't have a mirror pod).

If we don't have a mirror pod in memory, the right thing to do might just be to skip it (don't evict it). If we mistakenly evict a critical static pod, that can have drastic consequences for nodes.

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Mar 12, 2019

Author Member

I get your point. I will go logging an error instead of proceeding on this case, and when we hit this case, it means we're hitting an in-memory data inconsistency. Sounds good?

This comment has been minimized.

Copy link
@dashpole

dashpole Mar 12, 2019

Contributor

sounds good!

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:followup-74222 branch 2 times, most recently from d69675c to a31a72b Mar 12, 2019

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

And in user's perspective, the static pod is "not" evicted, then it "looks" like previous PR works.

@Huang-Wei - To be clear, my understanding is eventually the mirror pod would be generated again since static pod is not evicted. This will make any static pod to not get evicted which is a side-effect of change introduced in previous PR and now we are ensuring that we are actually evicting non-critical static pods?

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2019

now we are ensuring that we are actually evicting mirror pods?

Typo? Now we are evicting static pods (if it's a non-critical static pod).

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:followup-74222 branch from a31a72b to 53813bc Mar 13, 2019

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

Typo? Now we are evicting static pods (if it's a non-critical static pod).

Right, non-critical static pod, updated my comment but is my understanding correct?

if mirrorPod, ok := m.mirrorPodFunc(pod); ok && mirrorPod != nil {
// skip only when it's a static and critical pod
if kubelettypes.IsCriticalPod(mirrorPod) {
klog.Errorf("eviction manager: cannot evict a critical static pod %s", format.Pod(pod))

This comment has been minimized.

Copy link
@ravisantoshgudimetla

ravisantoshgudimetla Mar 13, 2019

Contributor

Can you add some unit tests for this?

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Mar 13, 2019

Author Member

TestCriticalPodsAreNotEvicted() already covers this. To make it still working, I added a simple mirrorPodFunc() implementation and pass to managerImpl.

}
} else {
// we should never hit this
klog.Errorf("eviction manager: cannot get mirror pod from static pod %s", format.Pod(pod))

This comment has been minimized.

Copy link
@ravisantoshgudimetla

ravisantoshgudimetla Mar 13, 2019

Contributor

and unit tests for this too?

kubelet: updated logic of verifying a static critical pod
- check if a pod is static by its static pod info
- meanwhile, check if a pod is critical by its corresponding mirror pod info

@Huang-Wei Huang-Wei force-pushed the Huang-Wei:followup-74222 branch from 53813bc to d67e7fd Mar 13, 2019

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 13, 2019

Right, non-critical static pod, updated my comment but is my understanding correct?

Yes.

@dims

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

@dims

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 14, 2019

@dims

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

/priority important-soon

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

thank you very much for the detailed description, it helped a lot when reviewing here.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 14, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit b3ec6c1 into kubernetes:master Mar 14, 2019

17 checks passed

cla/linuxfoundation Huang-Wei authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@Huang-Wei Huang-Wei deleted the Huang-Wei:followup-74222 branch Mar 14, 2019

@RobertKrawitz

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Is this planned for 1.13/1.12 backport?

@Huang-Wei

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2019

Yes, I will do it shortly.

@@ -57,6 +57,8 @@ type managerImpl struct {
config Config
// the function to invoke to kill a pod
killPodFunc KillPodFunc
// the function to get the mirror pod by a given statid pod

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 15, 2019

Contributor

typo: statid

return false
}
} else {
// we should never hit this

This comment has been minimized.

Copy link
@tedyu

tedyu Mar 15, 2019

Contributor

panic (since we won't hit this) ?

k8s-ci-robot added a commit that referenced this pull request Mar 21, 2019

Merge pull request #74994 from Huang-Wei/automated-cherry-pick-of-#74…
…222-upstream-release-1.13

[1.13] Automated cherry pick of #75144: kubelet: updated logic of verifying a static critical pod

k8s-ci-robot added a commit that referenced this pull request Mar 22, 2019

Merge pull request #74995 from Huang-Wei/automated-cherry-pick-of-#74…
…222-upstream-release-1.12

[1.12] Automated cherry pick of #75144: kubelet: updated logic of verifying a static critical pod

k8s-ci-robot added a commit that referenced this pull request Apr 1, 2019

Merge pull request #74996 from Huang-Wei/automated-cherry-pick-of-#74…
…222-upstream-release-1.11

[1.11] Automated cherry pick of #75144: kubelet: updated logic of verifying a static critical pod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.