New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: fix the bud -> when check pod pending delete it, main process blocked && format project #437
Conversation
@jonnydawg |
@calmkart seems like we can't reproduce the bug? I tried applying this deployment configuration
Which is expected since it can't find the checker pod that was deleted. However, once the check hits its timeout interval, another check gets provisioned and completes successfully. Have you tried waiting until the timeout interval instead of the run interval for the check to run again? Also, let me know if there's something else I'm missing in order to reproduce this! |
I'm not sure if I'm able to reproduce this bug yet -- still checking it out. I'm getting a lot of
|
how to trigger this bug videoThis log is very important
but never get shutdown the reason why |
I've been able to reproduce the bug, thanks for the video! I've also been testing your changes and they look good so far -- I'm going to continue some testing for a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @calmkart - this looks great! I appreciate the fix and thorough explanation of the issue. This could have gone on awhile and been hard to track down but you nailed it quickly.
for issue #436
fix the main process blocked bug.
The following are the detailed triggering reasons
if the watch for khcheck cr any change, we'll
k.RestartChecks()
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/cmd/kuberhealthy/kuberhealthy.go#L183-L199
but in
k.StopChecks()
it will wait forc.Shutdown()
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/cmd/kuberhealthy/kuberhealthy.go#L115-L140
int
c.Shutdown()
,triggerext.shutdownCTXFunc()
, and wait forext.wg.Wait()
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L1216-L1240
if we run a check, the
func (ext *Checker) RunOnce() error {}
it will waitForPodStarthttps://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L619-L631
the
ext.waitForPodStart
will runext.wg.Add(1)
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L938-L1015
but ,in the waitForPodStart() function, if the pod deleted before
p.Status.Phase == apiv1.PodRunning || p.Status.Phase == apiv1.PodFailed || p.Status.Phase == apiv1.PodSucceeded
, thefor e := range watcher.ResultChan()
for loop will blocked.so the
k.StopChecks()
will blocked, thek.RestartChecks()
will blocked.https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/cmd/kuberhealthy/kuberhealthy.go#L183-L199
we can watch for delete pod event here and return err.
I also saw a method
func (ext *Checker) watchForCheckerPodShutdown(shutdownEventNotifyC chan struct{}, ctx context.Context)
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L597-L599
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L374-L415
https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L460-L500
but the
func (ext *Checker) waitForDeletedEvent(eventsIn <-chan watch.Event, sawRemovalChan chan struct{}, stoppedChan chan struct{})
do nothing.https://github.com/Comcast/kuberhealthy/blob/3309e2b740e635f00bcd257c5ba0be5209ddaed2/pkg/checks/external/main.go#L469-L494
if a check pod create, it's status will be
pending
first, and thewaitForDeletedEvent
will return if it's status be 'pending', because it's awatch.Modified
, thecase watch.Deleted:
never run. Because we can't delete a pod before it pending.I haven't carefully observed whether this method has any other effect, so I didn't modify the logic here, or left it as it is. Just in
waitForPodStart()
watch for the delete event.