Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove terminated pod from summary api. #77426

Merged
merged 1 commit into from May 7, 2019

Conversation

@Random-Liu
Copy link
Member

commented May 4, 2019

I found this when debugging a containerd test failure introduced after #77101 (comment).

In some cases, we may have pod stats with the same name and namespace today:

  1. The pod sandbox was restarted once, and old pod sandboxes and containers are still kept around;
  2. Pods with the same name and namespace are deleted and recreated in a short period of time.

This only happens to the CRI integration, because we return stats for terminated containers from CRI. To keep the behavior of summary api consistent with before, we should filter out those terminated pods.

Or else, this may break metrics which only use labels {container name, pod name, pod namespace} today, e.g. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/stats/prometheus_resource_metrics.go#L47.

Errors like this will be returned when querying metrics:

an error on the server ("An error has occurred while serving metrics:\n\ncollected metric \"kubelet_container_log_filesystem_used_bytes\" { label:<name:\"container\" value:\"test-container-subpath-local-preprovisionedpv-2ws6\" > label:<name:\"namespace\" value:\"provisioning-85\" > label:<name:\"pod\" value:\"pod-subpath-test-local-preprovisionedpv-2ws6\" > gauge:<value:8192 > } was collected before with the same name and label values") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-t5wj:10250)

There are 2 metrics affected by this:

/cc @dashpole @davidz627

Signed-off-by: Lantao Liu lantaol@google.com

What type of PR is this?
/kind bug
/kind failing-test

If a pod has a running instance, the stats of its previously terminated instances will not show up in the kubelet summary stats any more for CRI runtimes like containerd and cri-o.

This keeps the behavior consistent with Docker integration, and fixes an issue that some container Prometheus metrics don't work when there are summary stats for multiple instances of the same pod.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 4, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Remove terminated pod from summary api.
Signed-off-by: Lantao Liu <lantaol@google.com>

@Random-Liu Random-Liu force-pushed the Random-Liu:remove-terminated-pod branch from 0f9e7c7 to 11cd424 May 4, 2019

@feiskyer
Copy link
Member

left a comment

lgtm

/retest

@yujuhong
Copy link
Member

left a comment

Need a release note.

pkg/kubelet/stats/cri_stats_provider.go Show resolved Hide resolved
@yujuhong

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Need a release note.
^^^^ still missing :-)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 7, 2019

@Random-Liu

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

Need a release note.

Done.

@dashpole

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 946087b into kubernetes:master May 7, 2019

20 checks passed

cla/linuxfoundation Random-Liu authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
alejandrox1 added a commit to alejandrox1/kubernetes that referenced this pull request May 7, 2019
Merge pull request kubernetes#77426 from Random-Liu/remove-terminated…
…-pod

Remove terminated pod from summary api.
}
}
if !found {
result = append(result, refs[len(refs)-1])

This comment has been minimized.

Copy link
@tedyu

tedyu May 8, 2019

Contributor

I wonder why this is needed ?
I assume the func would only keep READY sandboxes.

When I comment out this line, cri_stats_provider_test still passes.

k8s-ci-robot added a commit that referenced this pull request May 10, 2019
Merge pull request #77640 from Random-Liu/automated-cherry-pick-of-#7…
…7426-upstream-release-1.14

Automated cherry pick of #77426: Remove terminated pod from summary api.
k8s-ci-robot added a commit that referenced this pull request May 21, 2019
Merge pull request #77639 from Random-Liu/automated-cherry-pick-of-#7…
…7426-upstream-release-1.13

Automated cherry pick of #77426: Remove terminated pod from summary api.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.