Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent yet consistent nil pointer issue in /webhook/admission/http.go:84 #1148

Closed
yiqigao217 opened this issue Sep 1, 2020 · 8 comments · Fixed by #1155
Closed

Intermittent yet consistent nil pointer issue in /webhook/admission/http.go:84 #1148

yiqigao217 opened this issue Sep 1, 2020 · 8 comments · Fixed by #1155

Comments

@yiqigao217
Copy link

We got a webhook error intermittently but consistently, about 1 out of 7 deploys. The webhook error we got is:

Error from server (InternalError): Internal error occurred: failed calling webhook "namespaces.hnc.x-k8s.io": Post https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=30s: stream error: stream ID 13; INTERNAL_ERROR

Here's what the log says (the 8th line specifically):

2020-09-01T14:54:21.716198833Z stderr F 2020/09/01 14:54:21 http2: panic serving 10.244.0.1:48384: runtime error: invalid memory address or nil pointer dereference
2020-09-01T14:54:21.716224783Z stderr F goroutine 232 [running]:
2020-09-01T14:54:21.716228967Z stderr F net/http.(*http2serverConn).runHandler.func1(0xc000f4e020, 0xc00075ff8e, 0xc0006ae900)
2020-09-01T14:54:21.716232Z stderr F    /usr/local/go/src/net/http/h2_bundle.go:5713 +0x16b
2020-09-01T14:54:21.716234692Z stderr F panic(0x1791fa0, 0x29da610)
2020-09-01T14:54:21.716237649Z stderr F         /usr/local/go/src/runtime/panic.go:969 +0x166
2020-09-01T14:54:21.716239964Z stderr F sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP(0xc000ed9680, 0x7fbd452c1b18, 0xc000cedd80, 0xc00047e500)
2020-09-01T14:54:21.716242278Z stderr F         /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:84 +0x8a8
2020-09-01T14:54:21.716244784Z stderr F github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1(0x7fbd452c1b18, 0xc000cedd80, 0xc00047e500)
2020-09-01T14:54:21.716247116Z stderr F         /workspace/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:40 +0xab
2020-09-01T14:54:21.716249307Z stderr F net/http.HandlerFunc.ServeHTTP(0xc000ed98c0, 0x7fbd452c1b18, 0xc000cedd80, 0xc00047e500)
2020-09-01T14:54:21.716251671Z stderr F         /usr/local/go/src/net/http/server.go:2012 +0x44
2020-09-01T14:54:21.716253875Z stderr F github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716256032Z stderr F         /workspace/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:100 +0xda
2020-09-01T14:54:21.716258434Z stderr F net/http.HandlerFunc.ServeHTTP(0xc000ed9a10, 0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716260762Z stderr F         /usr/local/go/src/net/http/server.go:2012 +0x44
2020-09-01T14:54:21.716272844Z stderr F github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716275213Z stderr F         /workspace/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:76 +0xb2
2020-09-01T14:54:21.716278353Z stderr F net/http.HandlerFunc.ServeHTTP(0xc000ed9b00, 0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716281725Z stderr F         /usr/local/go/src/net/http/server.go:2012 +0x44
2020-09-01T14:54:21.716285187Z stderr F net/http.(*ServeMux).ServeHTTP(0xc000613900, 0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716288628Z stderr F         /usr/local/go/src/net/http/server.go:2387 +0x1a5
2020-09-01T14:54:21.71629225Z stderr F net/http.serverHandler.ServeHTTP(0xc000d620e0, 0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.716294532Z stderr F         /usr/local/go/src/net/http/server.go:2807 +0xa3
2020-09-01T14:54:21.716296648Z stderr F net/http.initALPNRequest.ServeHTTP(0x1cfe640, 0xc000f00090, 0xc000b22a80, 0xc000d620e0, 0x1cf9fc0, 0xc000f4e020, 0xc00047e500)
2020-09-01T14:54:21.71629879Z stderr F  /usr/local/go/src/net/http/server.go:3381 +0x8d
2020-09-01T14:54:21.716300936Z stderr F net/http.(*http2serverConn).runHandler(0xc0006ae900, 0xc000f4e020, 0xc00047e500, 0xc000634220)
2020-09-01T14:54:21.716303197Z stderr F         /usr/local/go/src/net/http/h2_bundle.go:5720 +0x8b
2020-09-01T14:54:21.716305467Z stderr F created by net/http.(*http2serverConn).processHeaders
2020-09-01T14:54:21.716307634Z stderr F         /usr/local/go/src/net/http/h2_bundle.go:5454 +0x4e1

The line /webhook/admission/http.go:84 is

wh.log.V(1).Info("received request", "UID", req.UID, "kind", req.Kind, "resource", req.Resource)

Since we have controller-runtime in the /vendor directory, I tried

  1. putting wh.log.Info("xxx") at the top of the func (wh *Webhook) ServeHTTP(). When the intermittent webhook error happened, the logs pointed to that first log line.
  2. omitting the wh.logs, the intermittent webhook error logs pointed to reviewResponse = wh.Handle(r.Context(), req).

So I guess there's a race-condition that could cause wh to be nil? To provide more info, we have some validating webhooks and CRD conversion webhooks.

Also I noticed there a TODO in the code that I'm not sure if it's relevant:

	// TODO: add panic-recovery for Handle
	reviewResponse = wh.Handle(r.Context(), req)
	wh.writeResponse(w, reviewResponse)
@yiqigao217
Copy link
Author

/cc @DirectXMan12
/cc @adrianludwin
/cc @GinnyJI

@DirectXMan12
Copy link
Contributor

After some investigation, it looks like the cause is a race between start and register occurring. Running under the race detector points out the race conditions.

I can reproduce pretty reliably without #1155, and cannot reproduce with it.

@DirectXMan12
Copy link
Contributor

You probably saw the initial log panic from "InjectLog" not being called yet due to the race between start & register.

@adrianludwin
Copy link

Thanks @DirectXMan12 ! We can work around that pretty easily now that we know what it is.

yiqigao217 added a commit to yiqigao217/multi-tenancy that referenced this issue Sep 9, 2020
According to
kubernetes-sigs/controller-runtime#1148,
there's a race condition between manager start and webhook register.
This commit moved webhook setup before manager start.

Tested on GKE cluster. The nil pointer error was not detected for 20
times.
yiqigao217 added a commit to yiqigao217/multi-tenancy that referenced this issue Sep 9, 2020
According to
kubernetes-sigs/controller-runtime#1148,
there's a race condition between manager start and webhook register.
This commit moved webhook setup before manager start.

Tested on GKE cluster. The nil pointer error was not detected for 20
times.
@adrianludwin
Copy link

It turns out that we can't work around this - our solution was to register the webhooks before calling Start but this only works if all the secrets for the webhooks already exist. This was true on the clusters we tested but is not true if you're starting HNC for the first time on a new cluster.

So we have the following ordering dependencies:

  1. We need to start the cert-controller before we register any webhooks (this is what we've just learned)
  2. We can't register the webhooks in parallel with calling Start (this was what we'd previously learned)

We'd previously solved constraint #2 by registering the webhook before calling Start but we now know that violates constraint #1.

While I'm not terribly happy about it, I'm thinking that our next best option is to add an N second sleep to our startup routine to reduce the odds of this happening. Our stable version of HNC (v0.5) doesn't seem to suffer this problem even though nothing's changed structurally except (we think) how long the various startup operations take. I'll examine that option next.

@adrianludwin
Copy link

Hmm, there certainly seems to be a correlation between "adding delay" and "it crashes less often," but after 16s of delay I'm still getting crashes! Here are the number of successful starts I observed before my first crash for a variety of delays:

  • 1s: 1
  • 2s: 1
  • 4s: 2
  • 8s: 5
  • 16s: 11

This doesn't give me a lot of confidence that adding delay is a robust solution. I'm mystified as to what kind of race condition would manifest itself more frequently after 8s than after 16s. I didn't observe any log messages being printed out during the sleep.

Just to be clear, what the code is currently doing is:

  • Starting one controller (reconciler) - a stripped-down replacement of cert-manager
  • Starting a goroutine to register all webhooks and remaining reconcilers, but then blocking until the cert-manager replacement says it's finished
  • Starting the controller-runtime manager; this is the "end" of the main goroutine.
  • Once unblocked (about 2s, typically), sleep for N seconds
  • Registers the validating and converting webhooks, then start the reconcilers.

I'm not even clear on what's even causing the race condition we're trying to avoid, tbh.

@adrianludwin
Copy link

@DirectXMan12 I pulled your fix from #1155, cherrypicked it into 0.6.3 and imported it into HNC (we were previously using 0.6.1 and jumping straight to master caused dependency issues). It worked perfectly 20/20 times; without this fix, it was failing after 1-2 attempts (see last comment). So I think the fix looks good.

@DirectXMan12
Copy link
Contributor

awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants