Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: fix the bud -> when check pod pending delete it, main process blocked && format project #437

Merged
merged 1 commit into from Apr 28, 2020

Conversation

calmkart
Copy link
Contributor

for issue #436
fix the main process blocked bug.

The following are the detailed triggering reasons

but ,in the waitForPodStart() function, if the pod deleted before p.Status.Phase == apiv1.PodRunning || p.Status.Phase == apiv1.PodFailed || p.Status.Phase == apiv1.PodSucceeded, the for e := range watcher.ResultChan() for loop will blocked.

so the k.StopChecks() will blocked, the k.RestartChecks() will blocked.
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/cmd/kuberhealthy/kuberhealthy.go#L183-L199

we can watch for delete pod event here and return err.

I also saw a method func (ext *Checker) watchForCheckerPodShutdown(shutdownEventNotifyC chan struct{}, ctx context.Context)
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L597-L599
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L374-L415
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L460-L500

but the func (ext *Checker) waitForDeletedEvent(eventsIn <-chan watch.Event, sawRemovalChan chan struct{}, stoppedChan chan struct{}) do nothing.
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L469-L494

if a check pod create, it's status will be pending first, and the waitForDeletedEvent will return if it's status be 'pending', because it's a watch.Modified, the case watch.Deleted: never run. Because we can't delete a pod before it pending.

I haven't carefully observed whether this method has any other effect, so I didn't modify the logic here, or left it as it is. Just in waitForPodStart() watch for the delete event.

@calmkart
Copy link
Contributor Author

@jonnydawg
PTAL, thanks.
This bug bothers me so much, I worked for it all day.

@joshulyne
Copy link
Collaborator

joshulyne commented Apr 24, 2020

@calmkart seems like we can't reproduce the bug? I tried applying this deployment configuration https://github.com/Comcast/kuberhealthy/blob/master/deploy/kuberhealthy.yaml and as soon as the deployment check comes up as pending, I delete the deployment checker pod and I see a lot of this in the kuberhealthy logs:

time="2020-04-24T18:37:20Z" level=info msg="4246412f-1e52-433a-889b-49d8b9b672cb kuberhealthy/deployment: [waiting for external checker pod to report in...]"
time="2020-04-24T18:37:20Z" level=info msg="4246412f-1e52-433a-889b-49d8b9b672cb kuberhealthy/deployment: [Last report time was: 0001-01-01 00:00:00 +0000 UTC vs 0001-01-01 00:00:00 +0000 UTC]"
time="2020-04-24T18:37:20Z" level=info msg="4246412f-1e52-433a-889b-49d8b9b672cb kuberhealthy/deployment: [have not yet seen pod update since 0001-01-01 00:00:00 +0000 UTC]"

Which is expected since it can't find the checker pod that was deleted. However, once the check hits its timeout interval, another check gets provisioned and completes successfully. Have you tried waiting until the timeout interval instead of the run interval for the check to run again? Also, let me know if there's something else I'm missing in order to reproduce this!

@jonnydawg
Copy link
Collaborator

jonnydawg commented Apr 24, 2020

I'm not sure if I'm able to reproduce this bug yet -- still checking it out. I'm getting a lot of

time="2020-04-24T18:12:23Z" level=info msg="dfaf914c-f3f7-4f90-b6a6-b8f53abbee53 kuberhealthy/deployment: [pod state is now: Pending]"
time="2020-04-24T18:12:23Z" level=info msg="dfaf914c-f3f7-4f90-b6a6-b8f53abbee53 kuberhealthy/deployment: [got an event while waiting for pod to start running]"
time="2020-04-24T18:12:23Z" level=info msg="dfaf914c-f3f7-4f90-b6a6-b8f53abbee53 kuberhealthy/deployment: [pod state is now: Pending]"
time="2020-04-24T18:12:29Z" level=debug msg="Starting a watch for khcheck object changes"
time="2020-04-24T18:12:30Z" level=debug msg="khcheck monitor saw an added event"
time="2020-04-24T18:12:30Z" level=debug msg="Change notification received. Scanning for external check changes..."
time="2020-04-24T18:12:30Z" level=debug msg="Scanning khcheck CRD kuberhealthy/deployment for changes since last seen..."
time="2020-04-24T18:13:10Z" level=info msg="khState reaper: starting to run an audit"
time="2020-04-24T18:13:10Z" level=info msg="khState reaper: analyzing 1 khState resources"
time="2020-04-24T18:13:10Z" level=debug msg="khState reaper: analyzing khState deployment in deployment"
time="2020-04-24T18:13:10Z" level=debug msg="khState reaper: deployment == deployment && kuberhealthy == kuberhealthy"
time="2020-04-24T18:13:10Z" level=info msg="khState reaper: deployment in kuberhealthy is still valid"
time="2020-04-24T18:14:10Z" level=info msg="khState reaper: starting to run an audit"
time="2020-04-24T18:14:10Z" level=info msg="khState reaper: analyzing 1 khState resources"
time="2020-04-24T18:14:10Z" level=debug msg="khState reaper: analyzing khState deployment in deployment"
time="2020-04-24T18:14:10Z" level=debug msg="khState reaper: deployment == deployment && kuberhealthy == kuberhealthy"

@calmkart
Copy link
Contributor Author

calmkart commented Apr 25, 2020

how to trigger this bug

video

This log is very important

Waiting for background workers to cleanup...

image

but never get shutdown

https://github.com/Comcast/kuberhealthy/blob/0e4d57a10cefed3e816f630197afaca0178b219a/pkg/checks/external/main.go#L1216-L1240

the reason why ext.wg.Wait() blocked was Already written on the main post.

@joshulyne @jonnydawg

@jonnydawg
Copy link
Collaborator

I've been able to reproduce the bug, thanks for the video! I've also been testing your changes and they look good so far -- I'm going to continue some testing for a bit.

Copy link
Collaborator

@jonnydawg jonnydawg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for doing this!

Copy link
Collaborator

@integrii integrii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @calmkart - this looks great! I appreciate the fix and thorough explanation of the issue. This could have gone on awhile and been hard to track down but you nailed it quickly.

@integrii integrii merged commit 7572e86 into kuberhealthy:master Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants