Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other Pod(s) Status Clean up & Pod(s) stuck in Terminating state #89

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

reddymh
Copy link

@reddymh reddymh commented Dec 12, 2022

Added the below changes:

  1. Pod(s) which are stuck in Terminating state and requires graceful delete based on age/time ( experimental change and will discuss about controlling thru flag)
  2. Pod(s) which are in Error/ContainerStatusUnknown/OOMKilled/Terminated/Completed(Sometimes running pod changes to completed due to node re-creation/preemptive nodes ) based on age/time.
  3. Other states like OutOfpods , OutOfcpus , Terminated etc

Issues #88 :

#88

@@ -24,13 +24,39 @@ func podRelatedToCronJob(pod *corev1.Pod, jobStore cache.Store) bool {
return false
}

func shouldDeletePod(pod *corev1.Pod, orphaned, pending, evicted, successful, failed time.Duration) bool {
func shouldDeleteTerminatingPod(pod *corev1.Pod, orphaned, pending, evicted, terminating, successful, failed time.Duration) bool {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to pass orphaned, pending, evicted,successful, failed durations if you only use terminating?

if !podFinishTime.IsZero() {
age := time.Since(podFinishTime)
if terminating > 0 && age >= terminating {
log.Println("Pod(s) Which Are In Terminating State")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Printf instead of 3 loglines

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwolf for troubleshooting I have added these logs and I thought I have removed those lines.will update the same.

// evicted pods, those with or without owner references, but in Evicted state
// - uses c.deleteEvictedAfter, this one is tricky, because there is no timestamp of eviction.
// So, basically it will be removed as soon as discovered
if pod.Status.Phase == corev1.PodFailed && pod.Status.Reason == "Evicted" && evicted > 0 {
return true
}
if pod.Status.Phase == corev1.PodFailed && pod.Status.Reason == "OutOfpods" && evicted > 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment about outOfpods and outOfcpu reason or a link to the docs describing it's behavior.

@@ -155,7 +156,7 @@ func TestKleaner_DeletePod(t *testing.T) {
}
for name, tc := range testCases {
t.Run(name, func(t *testing.T) {
result := shouldDeletePod(tc.podSpec, tc.orphaned, tc.pending, tc.evicted, tc.successful, tc.failed)
result := shouldDeletePod(tc.podSpec, tc.orphaned, tc.pending, tc.evicted, tc.terminated, tc.successful, tc.failed)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test scenario for every case that you're adding

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwolf yes. I will be adding the test cases for each scenario and pushing one by one change for this PR

func shouldDeletePod(pod *corev1.Pod, orphaned, pending, evicted, successful, failed time.Duration) bool {
func shouldDeleteTerminatingPod(pod *corev1.Pod, orphaned, pending, evicted, terminating, successful, failed time.Duration) bool {
// terminating pods which got hanged, those with or without owner references, but in Evicted state
// - uses c.deleteEvictedAfter, this one is tricky, because there is no timestamp of eviction.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this comment is just a copy-paste from the shouldDeletePod, cause it doesn't make any sense


// In Case If Pod(s) is Stuck in Terminating state just in case to delete with force
// Not goood way to fid the root cause of the issue why pod is stuck in terminating state
func (c *Kleaner) DeletePodWithForce(pod *corev1.Pod) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed in the issue I'd prefere not to have force-deletion in the codebase

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwolf yes but we can keep by enabling the flag. it will be useful when there is an issue calico and workload on this node will be blocked due to this issue so to avoid those pod(s) which are stuck in terminating state until the real issue is fixed

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the issue should be solved by node draining, not force-deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants