-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022
Comments
@liggitt: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
Please see the group list for a listing of the SIGs, working groups, and committees available. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@liggitt: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc |
seems to be coming from this poll loops kubernetes/test/integration/apiserver/admissionwebhook/admission_test.go Lines 681 to 682 in b7c2faf
kubernetes/test/integration/apiserver/admissionwebhook/admission_test.go Lines 681 to 682 in b7c2faf
the client retries internally the EOF |
that's... weird... I don't see any reason those would hit TLS EOF errors |
diff --git a/test/integration/apiserver/admissionwebhook/admission_test.go b/test/integration/apiserver/admissionwebhook/admission_test.go
index 4c64bdca26f..031e49326c8 100644
--- a/test/integration/apiserver/admissionwebhook/admission_test.go
+++ b/test/integration/apiserver/admissionwebhook/admission_test.go
@@ -275,7 +275,7 @@ func (h *holder) record(version string, phase string, converted bool, request *a
defer h.lock.Unlock()
// this is useful to turn on if items aren't getting recorded and you need to figure out why
- debug := false
+ debug := true
if debug {
h.t.Logf("%s %#v %v", request.Operation, request.Resource, request.SubResource)
}
@@ -733,6 +733,7 @@ func testResourceDelete(c *testContext) {
// wait for the item to be gone
err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
+ fmt.Println("DEBUG testResourceDelete")
obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
@@ -747,6 +748,7 @@ func testResourceDelete(c *testContext) {
c.t.Error(err)
return
}
+ fmt.Println("DEBUG testResourceDelete FINISH") admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1beta1", phase:"validation", converted:false} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"}
admission_test.go:280: DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"}
admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1", phase:"validation", converted:true} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"}
DEBUG testResourceDelete
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56638: EOF
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56634: EOF
DEBUG testResourceDelete FINISH
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56636: EOF
admission_test.go:280: CREATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} |
|
who is logging those lines with the EOF error? |
http.Server |
I've found the cause, but there are several things I'd like to sort out first |
Webhooks has a very interesting and complex setup, let me write it down for reference: Webhooks inside the apiserver use a RESTClient to contact the webhooks, this is handled by the kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go Lines 65 to 66 in dda9bcb
conversion and validationg/mutation webhooks doesn't use the same client manager though $ grep -r NewClientManager staging/
staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/webhook_converter.go: clientManager, err := webhook.NewClientManager(
staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/generic/webhook.go: cm, err := webhookutil.NewClientManager(
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:// NewClientManager creates a clientManager.
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:func NewClientManager(gvs []schema.GroupVersion, addToSchemaFuncs ...func(s *runtime.Scheme) error) (ClientMa The kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go Lines 122 to 124 in dda9bcb
but, interestingly, the transport is cached ONLY for webhooks using URL, because the transport is cacheable for the Client
webhooks using a Service use a custom dialer, so the transport is not cached, but (I have to verify it) the client will be cached by the kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go Lines 148 to 168 in dda9bcb
Regarding the
but if we just remove the sleep on the webhook server
the test fails with the TLS handshake errors
based on that I think that this problem is related to I've tried different things without success, I think that the divergence between URL and Service webhooks and the multiple caching layers can be problematic in a future (if it is not a problem already :) ) |
/priority important-soon |
We also see these errors a lot in cert-manager webhook logs when deployed on Kubernetes 1.23 or 1.24 when a larger number of resources are being applied that get validated by webhook. We also see some 'connection reset' errors that also appear to be new in 1.23 and 1.24, not sure if those might be related:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I also met this problem when deploy my own admission webhook, the kubernetes version is v1.24.6. Is there any solution or workaround to make my webhook workable? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
we are seeing this log in our operator which has a built-in webhook for conversion/validating/mutating. It seems doesn't affect the functionalities of the webhook but it keeps appearing in log like forever. As the log shows no other info as to the source of the log we have no idea where to ask for a fix. The go version is 1.19 and kubernetes version is 1.26.3. I'd be glad to provide more information if required. Thanks! |
Encountered the same issue in our webhook server, I would like to know if there's any workaround/improvement to make the server more resilient, since it seems not working to simply add retry mechanism on client side. |
What happened?
Ran
What did you expect to happen?
tests run without TLS errors
How can we reproduce it (as minimally and precisely as possible)?
Run integration tests
Anything else we need to know?
This happened on go1.17 and go1.18, so it's not new, but indicates we either have a setup issue in our integration tests, or our logging is outputting errors in situations that should not error
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: