Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no processing for successfully exited pods #765

Merged
merged 1 commit into from
Nov 1, 2022

Conversation

lucming
Copy link
Contributor

@lucming lucming commented Oct 31, 2022

Ⅰ. Describe what this PR does

The pod has successfully exited, and koordlet still keeps trying to modify the pod and container cgroups, which is not necessary
截屏2022-10-31 17 46 55
koordlet will keep writing the cgroup file even though the pod has exited successfully.However, as the pod exits successfully, the associated cgroup file is deleted, so the following error log is always reported.

I1031 17:42:38.810404 3754286 resource_update_executor.go:137] manager: CPUBurstExecutor, currentResource: &{0xc000ecfcc0 0 kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a/989ab9f8388c97628682d96662ab0573c319ddcb2743d7e49765ed8d6baa79ce {cpu.cfs_burst_us cpu/ true 0x3965fa0} {0 0 <nil>} 0x1cbe080 <nil> false}, preResource: <nil>, need update

I1031 17:42:38.810432 3754286 cpu_burst.go:518] update container kube-system/node-shell-44e1d62f-cc8e-468e-a966-e453e58afc02/shell cpu burst failed, dir kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a/989ab9f8388c97628682d96662ab0573c319ddcb2743d7e49765ed8d6baa79ce, updated true, error open /host-cgroup/cpu/kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a/989ab9f8388c97628682d96662ab0573c319ddcb2743d7e49765ed8d6baa79ce/cpu.cfs_burst_us: no such file or directory

I1031 17:42:38.810445 3754286 resource_update_executor.go:137] manager: CPUBurstExecutor, currentResource: &{0xc000ecfd40 0 kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a {cpu.cfs_burst_us cpu/ true 0x3965fa0} {0 0 <nil>} 0x1cbe080 <nil> false}, preResource: <nil>, need update

I1031 17:42:38.810474 3754286 cpu_burst.go:538] update pod kube-system/node-shell-44e1d62f-cc8e-468e-a966-e453e58afc02 cpu burst failed, dir kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a, updated true, error open /host-cgroup/cpu/kubepods/besteffort/podd7f19c1d-c555-476f-a237-ad4bf3154e2a/cpu.cfs_burst_us: no such file or directory

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@codecov
Copy link

codecov bot commented Oct 31, 2022

Codecov Report

Base: 68.40% // Head: 68.45% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (87d918f) compared to base (ced5252).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #765      +/-   ##
==========================================
+ Coverage   68.40%   68.45%   +0.05%     
==========================================
  Files         208      208              
  Lines       23956    24008      +52     
==========================================
+ Hits        16387    16435      +48     
- Misses       6426     6428       +2     
- Partials     1143     1145       +2     
Flag Coverage Δ
unittests 68.45% <0.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/koordlet/resmanager/cpu_burst.go 75.23% <0.00%> (-1.06%) ⬇️
pkg/koordlet/metriccache/storage.go 92.62% <0.00%> (-0.86%) ⬇️
pkg/koordlet/metriccache/metric_cache.go 58.32% <0.00%> (+0.59%) ⬆️
pkg/util/system/common_linux.go 64.46% <0.00%> (+0.82%) ⬆️
pkg/koordlet/metricsadvisor/collector.go 11.01% <0.00%> (+1.74%) ⬆️
...eduler/plugins/coscheduling/controller/podgroup.go 74.37% <0.00%> (+2.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -198,6 +198,12 @@ func (b *CPUBurst) start() {
// ignore non-burstable pod, e.g. LSR, BE pods
continue
}
if podMeta.Pod.Status.Phase == corev1.PodSucceeded {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about other phases like corev1.PodFailed? We may check if the phase is not Running or Pending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about other phases like corev1.PodFailed? We may check if the phase is not Running or Pending.

Because only successfully exited pods are sure never to be restarted.
kubernetes-pod-life-cycle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's restart policy is set to never?

Copy link
Member

@saintube saintube Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodFailed indicates all containers have terminated, so the koordlet would fail to update any of the container-level cgroups without correct cgroup paths.

Copy link
Contributor Author

@lucming lucming Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the container cgroup file is only available as long as the container exists.
so we should only deal with the running pods,is right?

// These are the valid statuses of pods.
const (
	// PodPending means the pod has been accepted by the system, but one or more of the containers
	// has not been started. This includes time before being bound to a node, as well as time spent
	// pulling images onto the host.
	PodPending PodPhase = "Pending"
	// PodRunning means the pod has been bound to a node and all of the containers have been started.
	// At least one container is still running or is in the process of being restarted.
	PodRunning PodPhase = "Running"
	// PodSucceeded means that all containers in the pod have voluntarily terminated
	// with a container exit code of 0, and the system is not going to restart any of these containers.
	PodSucceeded PodPhase = "Succeeded"
	// PodFailed means that all containers in the pod have terminated, and at least one container has
	// terminated in a failure (exited with a non-zero exit code or was stopped by the system).
	PodFailed PodPhase = "Failed"
	// PodUnknown means that for some reason the state of the pod could not be obtained, typically due
	// to an error in communicating with the host of the pod.
	// Deprecated in v1.21: It isn't being set since 2015 (74da3b14b0c0f658b3bb8d2def5094686d0e9095)
	PodUnknown PodPhase = "Unknown"
)

Copy link
Member

@saintube saintube Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments indicate that Pending pods may have some containers running. So I still recommend using both PodPending and PodRunning.
e.g. a pod with part of the init containers started, a pod with part of regular containers terminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll change the code then.

@lucming lucming force-pushed the code_collation13 branch 2 times, most recently from 08de4dd to ec0e0fb Compare November 1, 2022 09:40
Signed-off-by: lucming <2876757716@qq.com>
@koordinator-bot
Copy link

New changes are detected. LGTM label has been removed.

@hormes
Copy link
Member

hormes commented Nov 1, 2022

/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hormes hormes added the lgtm label Nov 1, 2022
@koordinator-bot koordinator-bot bot merged commit a8e1fbc into koordinator-sh:main Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants