Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove terminated pod from summary api. #77426

Merged
merged 1 commit into from
May 7, 2019

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented May 4, 2019

I found this when debugging a containerd test failure introduced after #77101 (comment).

In some cases, we may have pod stats with the same name and namespace today:

  1. The pod sandbox was restarted once, and old pod sandboxes and containers are still kept around;
  2. Pods with the same name and namespace are deleted and recreated in a short period of time.

This only happens to the CRI integration, because we return stats for terminated containers from CRI. To keep the behavior of summary api consistent with before, we should filter out those terminated pods.

Or else, this may break metrics which only use labels {container name, pod name, pod namespace} today, e.g. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/stats/prometheus_resource_metrics.go#L47.

Errors like this will be returned when querying metrics:

an error on the server ("An error has occurred while serving metrics:\n\ncollected metric \"kubelet_container_log_filesystem_used_bytes\" { label:<name:\"container\" value:\"test-container-subpath-local-preprovisionedpv-2ws6\" > label:<name:\"namespace\" value:\"provisioning-85\" > label:<name:\"pod\" value:\"pod-subpath-test-local-preprovisionedpv-2ws6\" > gauge:<value:8192 > } was collected before with the same name and label values") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-t5wj:10250)

There are 2 metrics affected by this:

/cc @dashpole @davidz627

Signed-off-by: Lantao Liu lantaol@google.com

What type of PR is this?
/kind bug
/kind failing-test

If a pod has a running instance, the stats of its previously terminated instances will not show up in the kubelet summary stats any more for CRI runtimes like containerd and cri-o.

This keeps the behavior consistent with Docker integration, and fixes an issue that some container Prometheus metrics don't work when there are summary stats for multiple instances of the same pod.

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 4, 2019
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 4, 2019
@Random-Liu Random-Liu added this to the v1.14 milestone May 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2019
@Random-Liu Random-Liu added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 4, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 4, 2019
@Random-Liu Random-Liu added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 4, 2019
@k8s-ci-robot k8s-ci-robot added area/kubelet and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 4, 2019
@Random-Liu Random-Liu modified the milestones: v1.14, v1.13 May 4, 2019
Signed-off-by: Lantao Liu <lantaol@google.com>
Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

/retest

Copy link
Contributor

@yujuhong yujuhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a release note.

pkg/kubelet/stats/cri_stats_provider.go Show resolved Hide resolved
@yujuhong
Copy link
Contributor

yujuhong commented May 7, 2019

Need a release note.
^^^^ still missing :-)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 7, 2019
@Random-Liu
Copy link
Member Author

Need a release note.

Done.

@dashpole
Copy link
Contributor

dashpole commented May 7, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 946087b into kubernetes:master May 7, 2019
}
}
if !found {
result = append(result, refs[len(refs)-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is needed ?
I assume the func would only keep READY sandboxes.

When I comment out this line, cri_stats_provider_test still passes.

k8s-ci-robot added a commit that referenced this pull request May 10, 2019
…7426-upstream-release-1.14

Automated cherry pick of #77426: Remove terminated pod from summary api.
k8s-ci-robot added a commit that referenced this pull request May 21, 2019
…7426-upstream-release-1.13

Automated cherry pick of #77426: Remove terminated pod from summary api.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants