-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent yet consistent nil pointer
issue in /webhook/admission/http.go:84
#1148
Comments
/cc @DirectXMan12 |
After some investigation, it looks like the cause is a race between start and register occurring. Running under the race detector points out the race conditions. I can reproduce pretty reliably without #1155, and cannot reproduce with it. |
You probably saw the initial log panic from "InjectLog" not being called yet due to the race between start & register. |
Thanks @DirectXMan12 ! We can work around that pretty easily now that we know what it is. |
According to kubernetes-sigs/controller-runtime#1148, there's a race condition between manager start and webhook register. This commit moved webhook setup before manager start. Tested on GKE cluster. The nil pointer error was not detected for 20 times.
According to kubernetes-sigs/controller-runtime#1148, there's a race condition between manager start and webhook register. This commit moved webhook setup before manager start. Tested on GKE cluster. The nil pointer error was not detected for 20 times.
It turns out that we can't work around this - our solution was to register the webhooks before calling So we have the following ordering dependencies:
We'd previously solved constraint #2 by registering the webhook before calling While I'm not terribly happy about it, I'm thinking that our next best option is to add an N second sleep to our startup routine to reduce the odds of this happening. Our stable version of HNC (v0.5) doesn't seem to suffer this problem even though nothing's changed structurally except (we think) how long the various startup operations take. I'll examine that option next. |
Hmm, there certainly seems to be a correlation between "adding delay" and "it crashes less often," but after 16s of delay I'm still getting crashes! Here are the number of successful starts I observed before my first crash for a variety of delays:
This doesn't give me a lot of confidence that adding delay is a robust solution. I'm mystified as to what kind of race condition would manifest itself more frequently after 8s than after 16s. I didn't observe any log messages being printed out during the sleep. Just to be clear, what the code is currently doing is:
I'm not even clear on what's even causing the race condition we're trying to avoid, tbh. |
@DirectXMan12 I pulled your fix from #1155, cherrypicked it into 0.6.3 and imported it into HNC (we were previously using 0.6.1 and jumping straight to |
awesome |
We got a webhook error intermittently but consistently, about 1 out of 7 deploys. The webhook error we got is:
Here's what the log says (the 8th line specifically):
The line
/webhook/admission/http.go:84
isSince we have
controller-runtime
in the/vendor
directory, I triedwh.log.Info("xxx")
at the top of thefunc (wh *Webhook) ServeHTTP()
. When the intermittent webhook error happened, the logs pointed to that first log line.wh.log
s, the intermittent webhook error logs pointed toreviewResponse = wh.Handle(r.Context(), req)
.So I guess there's a race-condition that could cause
wh
to be nil? To provide more info, we have some validating webhooks and CRD conversion webhooks.Also I noticed there a TODO in the code that I'm not sure if it's relevant:
The text was updated successfully, but these errors were encountered: