Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in kube-apiserverver #117363

Closed
soltysh opened this issue Apr 14, 2023 · 13 comments · Fixed by kubernetes/kube-openapi#387 or #117705
Closed

panic in kube-apiserverver #117363

soltysh opened this issue Apr 14, 2023 · 13 comments · Fixed by kubernetes/kube-openapi#387 or #117705
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@soltysh
Copy link
Contributor

soltysh commented Apr 14, 2023

What happened?

While bringing in k8s 1.27 into openshift I've noticed rare occurrences of the following panic in kube-apiserver:

E0414 12:05:22.601877      16 runtime.go:77] Observed a panic: runtime error: index out of range [0] with length 0
goroutine 165589 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
	vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:110 +0x9c
panic({0x4c8e9c0, 0xc01e053968})
	/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.3.linux.amd64/src/runtime/panic.go:884 +0x213
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1.1()
	vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:84 +0x2a
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
	vendor/k8s.io/apiserver/pkg/endpoints/filters/audit.go:93 +0x25c
panic({0x4c8e9c0, 0xc01e053968})
	/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.3.linux.amd64/src/runtime/panic.go:884 +0x213
github.com/google/gnostic/openapiv2.ParseDocument({0x0?, 0x290?, 0x4acc6e0?})
	vendor/github.com/google/gnostic/openapiv2/document.go:29 +0xc5
k8s.io/kube-openapi/pkg/handler.ToProtoBinary({0x0?, 0x0?, 0xc0415d37e0?})
	vendor/k8s.io/kube-openapi/pkg/handler/handler.go:121 +0x1e
k8s.io/kube-openapi/pkg/handler.NewOpenAPIServiceLazy.func2({{{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x0, 0x0}})
	vendor/k8s.io/kube-openapi/pkg/handler/handler.go:91 +0x89
k8s.io/kube-openapi/pkg/cached.NewTransformer[...].func1()
	vendor/k8s.io/kube-openapi/pkg/cached/cache.go:196 +0x135
k8s.io/kube-openapi/pkg/cached.(*merger[...]).Get(0x593daa0)
	vendor/k8s.io/kube-openapi/pkg/cached/cache.go:175 +0xd1
k8s.io/kube-openapi/pkg/handler.(*OpenAPIService).RegisterOpenAPIVersionedService.func1({0x59198a0, 0xc02182cee0}, 0xc013142300)
	vendor/k8s.io/kube-openapi/pkg/handler/handler.go:166 +0x3b1
net/http.HandlerFunc.ServeHTTP(0x0?, {0x59198a0?, 0xc02182cee0?}, 0x4?)
	/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.20.3.linux.amd64/src/net/http/server.go:2122 +0x2f
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x591a8c0, 0xc00b3a0fe0}, 0x1?)
	vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e

The whole panic can be viewed here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-kubernetes-1524-ci-4.14-e2e-gcp-ovn-upgrade/1646803175610519552/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-kube-apiserver_kube-apiserver-ci-op-ndiw873z-dd74f-m7x98-master-2_kube-apiserver.log

What did you expect to happen?

No panic

How can we reproduce it (as minimally and precisely as possible)?

package main

import (
	"fmt"
	openapi_v2 "github.com/google/gnostic/openapiv2"
)

func main() {
	json := make([]byte, 0)
	document, err := openapi_v2.ParseDocument(json)
	if err != nil {
	 	fmt.Println(err)
	 	return
	}
	fmt.Println(document)
	fmt.Printf("%q\n", json)
}

will fail in similar way as the panic above. The problematic bit seems to be here:
https://github.com/google/gnostic/blob/ade94e0d08cb9c60272a311608cd5dabd30d1813/openapiv2/document.go#L29
we're trying to read Content from a valid, empty document, which results in the panic.

Kubernetes version

v1.27.0

Cloud provider

AWS, GCP

@soltysh soltysh added the kind/bug Categorizes issue or PR as related to a bug. label Apr 14, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 14, 2023
@soltysh
Copy link
Contributor Author

soltysh commented Apr 14, 2023

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 14, 2023
@soltysh
Copy link
Contributor Author

soltysh commented Apr 14, 2023

I'm still trying to confirm that this is indeed caused by empty json passed to https://github.com/kubernetes/kube-openapi/blob/f5883ff37f0c685e2c0ad657c26c52baba49be6d/pkg/handler/handler.go#L91 but I can't force the CI to fail just now 😅

@liggitt
Copy link
Member

liggitt commented Apr 14, 2023

/cc @apelisse

@Jefftree
Copy link
Member

/assign

@Jefftree
Copy link
Member

Some preliminary investigation:

  • https://github.com/google/gnostic/blob/ade94e0d08cb9c60272a311608cd5dabd30d1813/openapiv2/document.go#L29 does throw the same error when parsing an empty byte array make([]byte, 0)
  • I'm not exactly sure how this is manifested since the input must go through the json marshalling which would normally return the null byte (different from empty byte array) on an empty input
  • We recently implemented the JSON experiment marshaler and our marshal calls are changed from json.Marshal(data) to data.MarshalJSON(). It might handle certain edge cases differently and I'm wondering if there's a specific configuration that causes an empty byte array to be outputted

@liggitt
Copy link
Member

liggitt commented Apr 14, 2023

We recently implemented the JSON experiment marshaler and our marshal calls are changed from json.Marshal(data) to data.MarshalJSON(). It might handle certain edge cases differently and I'm wondering if there's a specific configuration that causes an empty byte array to be outputted

I really don't think that's the issue

I think it's more likely something in the jsonCache → protoCache chain is passing a cache miss and zero-value result to protoCache, so result.Err is nil, but result.Data is also zero-value, but I'm having a really hard time following the generic code in https://github.com/kubernetes/kube-openapi/blob/f5883ff37f0c685e2c0ad657c26c52baba49be6d/pkg/cached/cache.go#L190

@liggitt
Copy link
Member

liggitt commented Apr 14, 2023

@soltysh, can you add something like this to the CI run to try to figure out the path that leads to empty JSON:

diff --git a/vendor/k8s.io/kube-openapi/pkg/handler/handler.go b/vendor/k8s.io/kube-openapi/pkg/handler/handler.go
index 84e90264653..f76e2815386 100644
--- a/vendor/k8s.io/kube-openapi/pkg/handler/handler.go
+++ b/vendor/k8s.io/kube-openapi/pkg/handler/handler.go
@@ -21,6 +21,7 @@ import (
 	"crypto/sha512"
 	"fmt"
 	"net/http"
+	"runtime/debug"
 	"strconv"
 	"time"
 
@@ -30,6 +31,7 @@ import (
 	openapi_v2 "github.com/google/gnostic/openapiv2"
 	"github.com/google/uuid"
 	"github.com/munnerz/goautoneg"
+
 	klog "k8s.io/klog/v2"
 	"k8s.io/kube-openapi/pkg/builder"
 	"k8s.io/kube-openapi/pkg/cached"
@@ -88,6 +90,12 @@ func NewOpenAPIServiceLazy(swagger cached.Data[*spec.Swagger]) *OpenAPIService {
 		if result.Err != nil {
 			return cached.NewResultErr[timedSpec](result.Err)
 		}
+		if len(result.Data.spec) == 0 {
+			err := fmt.Errorf("unexpectedly empty json")
+			fmt.Println(err)
+			debug.PrintStack()
+			return cached.NewResultErr[timedSpec](err)
+		}
 		proto, err := ToProtoBinary(result.Data.spec)
 		if err != nil {
 			return cached.NewResultErr[timedSpec](err)

@apelisse
Copy link
Member

apelisse commented Apr 14, 2023

The panic we currently have is on the next line, I don't think that we'll add a lot of information since we'll have the same backtrace?

@liggitt
Copy link
Member

liggitt commented Apr 14, 2023

Oh… right. Hrmm. Suggestions for figuring out how a zero length data is propagating here?

@apelisse
Copy link
Member

Looking into it. It might be more useful to get the trace when we generate the json (and see if we actually generate a 0 length json), i.e. just before this line: https://github.com/kubernetes/kube-openapi/blob/master/pkg/handler/handler.go#L85

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2023
@fedebongio
Copy link
Contributor

Thank you all for the investigation and help!

@liggitt
Copy link
Member

liggitt commented May 1, 2023

reopening until fix is merged into kubernetes - #117705 (and 1.27 cherry-pick)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants