integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

liggitt · 2022-03-25T15:15:27Z

What happened?

Ran

go test ./test/integration/apiserver/admissionwebhook -v -run TestWebhookAdmissionWithWatchCache

What did you expect to happen?

tests run without TLS errors

How can we reproduce it (as minimally and precisely as possible)?

Run integration tests

Anything else we need to know?

This happened on go1.17 and go1.18, so it's not new, but indicates we either have a setup issue in our integration tests, or our logging is outputting errors in situations that should not error

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-25T15:15:33Z

@liggitt: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-03-25T15:15:35Z

@liggitt: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2022-03-25T15:25:43Z

/cc

aojea · 2022-03-25T15:41:41Z

seems to be coming from this poll loops

kubernetes/test/integration/apiserver/admissionwebhook/admission_test.go

Lines 681 to 682 in b7c2faf

    
           err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) { 
        
           	obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})

kubernetes/test/integration/apiserver/admissionwebhook/admission_test.go

Line 726 in b7c2faf

    
           err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {

kubernetes/test/integration/apiserver/admissionwebhook/admission_test.go

Lines 681 to 682 in b7c2faf

    
           err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) { 
        
           	obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})

the client retries internally the EOF

liggitt · 2022-03-25T16:11:26Z

that's... weird... I don't see any reason those would hit TLS EOF errors

aojea · 2022-03-25T16:19:19Z

diff --git a/test/integration/apiserver/admissionwebhook/admission_test.go b/test/integration/apiserver/admissionwebhook/admission_test.go
index 4c64bdca26f..031e49326c8 100644
--- a/test/integration/apiserver/admissionwebhook/admission_test.go
+++ b/test/integration/apiserver/admissionwebhook/admission_test.go
@@ -275,7 +275,7 @@ func (h *holder) record(version string, phase string, converted bool, request *a
        defer h.lock.Unlock()
 
        // this is useful to turn on if items aren't getting recorded and you need to figure out why
-       debug := false
+       debug := true
        if debug {
                h.t.Logf("%s %#v %v", request.Operation, request.Resource, request.SubResource)
        }
@@ -733,6 +733,7 @@ func testResourceDelete(c *testContext) {
 
        // wait for the item to be gone
        err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
+               fmt.Println("DEBUG testResourceDelete")
                obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})
                if apierrors.IsNotFound(err) {
                        return true, nil
@@ -747,6 +748,7 @@ func testResourceDelete(c *testContext) {
                c.t.Error(err)
                return
        }
+       fmt.Println("DEBUG testResourceDelete FINISH")

    admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1beta1", phase:"validation", converted:false} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:280: DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1", phase:"validation", converted:true} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
DEBUG testResourceDelete
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56638: EOF
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56634: EOF
DEBUG testResourceDelete FINISH
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56636: EOF
    admission_test.go:280: CREATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"}

aojea · 2022-03-25T17:37:31Z

I0325 18:35:16.399766  444569 pathrecorder.go:248] apiextensions-apiserver: "/apis/random.numbers.com/v1/integers/fortytwo" satisfied by prefix /apis/
I0325 18:35:16.400329  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1beta1/mutation?timeout=30s
I0325 18:35:16.400337  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.400345  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
I0325 18:35:16.400352  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.400359  444569 round_trippers.go:473]     Content-Type: application/json
    admission_test.go:281: UPDATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56520: EOF
    admission_test.go:297: {random.numbers.com v1 integers} != {  }
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56522: EOF
I0325 18:35:16.400604  444569 round_trippers.go:574] Response Status: 200 OK in 0 milliseconds
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56524: EOF
I0325 18:35:16.400950  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1beta1/convert/mutation?timeout=30s
I0325 18:35:16.400959  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.400972  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.400987  444569 round_trippers.go:473]     Content-Type: application/json
I0325 18:35:16.401000  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
    admission_test.go:281: UPDATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:297: {random.numbers.com v1 integers} != {  }
I0325 18:35:16.401255  444569 round_trippers.go:574] Response Status: 200 OK in 0 milliseconds
I0325 18:35:16.401598  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1/mutation?timeout=10s
I0325 18:35:16.401607  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.401615  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.401623  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
I0325 18:35:16.401631  444569 round_trippers.go:473]     Content-Type: application/json

aojea · 2022-03-25T17:41:46Z

who is logging those lines with the EOF error?

liggitt · 2022-03-25T19:16:05Z

who is logging those lines with the EOF error?

http.Server

aojea · 2022-03-28T10:55:46Z

I've found the cause, but there are several things I'd like to sort out first

aojea · 2022-03-30T16:30:19Z

Webhooks has a very interesting and complex setup, let me write it down for reference:

Webhooks inside the apiserver use a RESTClient to contact the webhooks, this is handled by the ClientManager

kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go

Lines 65 to 66 in dda9bcb

    
           func NewClientManager(gvs []schema.GroupVersion, addToSchemaFuncs ...func(s *runtime.Scheme) error) (ClientManager, error) { 
        
           	cache, err := lru.New(defaultCacheSize)

conversion and validationg/mutation webhooks doesn't use the same client manager though

$ grep -r NewClientManager staging/
staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/webhook_converter.go:       clientManager, err := webhook.NewClientManager(
staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/generic/webhook.go:   cm, err := webhookutil.NewClientManager(
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:// NewClientManager creates a clientManager.
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:func NewClientManager(gvs []schema.GroupVersion, addToSchemaFuncs ...func(s *runtime.Scheme) error) (ClientMa

The ClientManager caches the clients, however, the key takes the URL path into account, that means that same host will not be cached if they have different path, i.e. webhook.url/validatingv1 and webhook.url/validatingv1beta1 will have a different client

kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go

Lines 122 to 124 in dda9bcb

    
           ccWithNoName := cc 
        
           ccWithNoName.Name = "" 
        
           cacheKey, err := json.Marshal(ccWithNoName)

but, interestingly, the transport is cached ONLY for webhooks using URL, because the transport is cacheable for the Client

kubernetes/staging/src/k8s.io/client-go/transport/cache.go

Line 136 in 0765ba8

if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {

webhooks using a Service use a custom dialer, so the transport is not cached, but (I have to verify it) the client will be cached by the ClientManager cache

kubernetes/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go

Lines 148 to 168 in dda9bcb

    
           if cc.Service != nil { 
        
           	restConfig, err := cm.authInfoResolver.ClientConfigForService(cc.Service.Name, cc.Service.Namespace) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	cfg := rest.CopyConfig(restConfig) 
        
           	serverName := cc.Service.Name + "." + cc.Service.Namespace + ".svc" 
        
           	host := serverName + ":443" 
        
           	cfg.Host = "https://" + host 
        
           	cfg.APIPath = cc.Service.Path 
        
           	// Set the server name if not already set 
        
           	if len(cfg.TLSClientConfig.ServerName) == 0 { 
        
           		cfg.TLSClientConfig.ServerName = serverName 
        
           	} 
        
           	delegateDialer := cfg.Dial 
        
           	if delegateDialer == nil { 
        
           		var d net.Dialer 
        
           		delegateDialer = d.DialContext 
        
           	} 
        
           	cfg.Dial = func(ctx context.Context, network, addr string) (net.Conn, error) {

Regarding the http: TLS handshake error from 127.0.0.1:56638: EOF errors, I was surprised it wasn't happening on this test too

kubernetes/test/integration/apiserver/admissionwebhook/load_balance_test.go

Line 51 in 0765ba8

func TestWebhookLoadBalance(t *testing.T) {

but if we just remove the sleep on the webhook server

diff --git a/test/integration/apiserver/admissionwebhook/load_balance_test.go b/test/integration/apiserver/admissionwebhook/load_balance_test.go
index 0ed6c3911f3..42369d5c780 100644
--- a/test/integration/apiserver/admissionwebhook/load_balance_test.go
+++ b/test/integration/apiserver/admissionwebhook/load_balance_test.go
@@ -276,7 +276,7 @@ func newLoadBalanceWebhookHandler(recorder *connectionRecorder) http.Handler {
                }
 
                // simulate a loaded backend
-               time.Sleep(2 * time.Second)
+               //time.Sleep(2 * time.Second)
                allow(w)
        })
 }

the test fails with the TLS handshake errors

2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57444: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57440: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57442: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57446: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57452: EOF
    load_balance_test.go:208: expected no additional connections (reusing kept-alive connections), got 9
E0330 18:27:45.585947  426498 context.go:98] "Attempted to add audit annotations from unsupported request chain" annotations=[authorization.k8s.io/decision allow authorization.k8s.io/reason ]
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57448: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57450: EOF

based on that I think that this problem is related to

golang/go#50984

I've tried different things without success, I think that the divergence between URL and Service webhooks and the multiple caching layers can be problematic in a future (if it is not a problem already :) )

aojea · 2022-03-30T17:00:34Z

/priority important-soon

irbekrm · 2022-06-06T08:45:10Z

We also see these errors a lot in cert-manager webhook logs when deployed on Kubernetes 1.23 or 1.24 when a larger number of resources are being applied that get validated by webhook.

We also see some 'connection reset' errors that also appear to be new in 1.23 and 1.24, not sure if those might be related:

...
I0606 08:43:29.003459       1 logs.go:59] http: TLS handshake error from 10.244.0.1:23178: EOF
I0606 08:43:29.013764       1 logs.go:59] http: TLS handshake error from 10.244.0.1:8326: read tcp 10.244.0.5:10250->10.244.0.1:8326: read: connection reset by peer
I0606 08:43:29.014414       1 logs.go:59] http: TLS handshake error from 10.244.0.1:14238: EOF
...

k8s-triage-robot · 2022-09-04T09:38:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ritazh · 2022-09-09T00:15:19Z

/remove-lifecycle stale

cyniczhi · 2022-11-16T02:33:17Z

I also met this problem when deploy my own admission webhook, the kubernetes version is v1.24.6. Is there any solution or workaround to make my webhook workable?

k8s-triage-robot · 2023-02-14T02:45:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sozercan · 2023-03-06T20:42:26Z

/remove-lifecycle stale

enj · 2023-03-20T18:19:12Z

/lifecycle frozen

gaohoward · 2023-06-14T13:55:02Z

we are seeing this log in our operator which has a built-in webhook for conversion/validating/mutating. It seems doesn't affect the functionalities of the webhook but it keeps appearing in log like forever. As the log shows no other info as to the source of the log we have no idea where to ask for a fix. The go version is 1.19 and kubernetes version is 1.26.3. I'd be glad to provide more information if required. Thanks!

reborn1867 · 2024-03-22T03:20:04Z

Encountered the same issue in our webhook server, I would like to know if there's any workaround/improvement to make the server more resilient, since it seems not working to simply add retry mechanism on client side.

liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 25, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2022

liggitt added area/test and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2022

liggitt added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 25, 2022

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 30, 2022

aojea mentioned this issue Mar 31, 2022

Integration Webhook tests: TLS handshakes EOF errors #109193

Closed

irbekrm mentioned this issue Jun 1, 2022

Slow healthz and livez endpoints cause liveness and readiness probe failures cert-manager/cert-manager#5137

Closed

irbekrm mentioned this issue Jun 6, 2022

TLS handshake error: EOF cert-manager/cert-manager#4594

Closed

ritazh mentioned this issue Jul 1, 2022

TLS handshake error from: EOF open-policy-agent/gatekeeper#2142

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2022

mochizuki875 mentioned this issue Jan 5, 2023

read: connection reset by peer kubernetes-sigs/hierarchical-namespaces#236

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2023

jonathan-innis mentioned this issue Nov 2, 2023

TLS handshake error from xx: read tcp xx -> xx: read: connection reset by peer kubernetes-sigs/karpenter#718

Closed

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 20, 2023

ikyrycho mentioned this issue Apr 17, 2023

EOF error in cdi-apiserver kubevirt/containerized-data-importer#2695

Closed

aojea mentioned this issue Apr 29, 2023

Make connection lost retryable in webhook #117251

Merged

matthchr mentioned this issue Sep 25, 2023

Unable to get resourcegroup, it is created but cannot be seen with kubectl get Azure/azure-service-operator#3329

Closed

erikgb mentioned this issue Oct 26, 2023

Annoying "http: TLS handshake error" in logs cybozu-go/accurate#104

Open

4 tasks

This was referenced Jan 10, 2024

Migrate to XP ManagementPolicies crossplane-contrib/provider-kubernetes#163

Merged

TLS error logs crossplane-contrib/provider-kubernetes#175

Open

moolen mentioned this issue Jan 18, 2024

Webhook Pod outputs TLS handshake errors external-secrets/external-secrets#2983

Closed

Julian-Chu mentioned this issue Apr 25, 2024

VPA admission controller: TLS handshake error kubernetes/autoscaler#6770

Open

shreyas-s-rao mentioned this issue May 9, 2024

Druid Refactor to Address Multiple Controller Conflicts gardener/etcd-druid#777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

liggitt commented Mar 25, 2022

k8s-ci-robot commented Mar 25, 2022

k8s-ci-robot commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022 •

edited

liggitt commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022

liggitt commented Mar 25, 2022

aojea commented Mar 28, 2022

aojea commented Mar 30, 2022 •

edited

aojea commented Mar 30, 2022

irbekrm commented Jun 6, 2022

k8s-triage-robot commented Sep 4, 2022

ritazh commented Sep 9, 2022

cyniczhi commented Nov 16, 2022

k8s-triage-robot commented Feb 14, 2023

sozercan commented Mar 6, 2023

enj commented Mar 20, 2023

gaohoward commented Jun 14, 2023

reborn1867 commented Mar 22, 2024

integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

Comments

liggitt commented Mar 25, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Mar 25, 2022

k8s-ci-robot commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022 • edited

liggitt commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022

aojea commented Mar 25, 2022

liggitt commented Mar 25, 2022

aojea commented Mar 28, 2022

aojea commented Mar 30, 2022 • edited

aojea commented Mar 30, 2022

irbekrm commented Jun 6, 2022

k8s-triage-robot commented Sep 4, 2022

ritazh commented Sep 9, 2022

cyniczhi commented Nov 16, 2022

k8s-triage-robot commented Feb 14, 2023

sozercan commented Mar 6, 2023

enj commented Mar 20, 2023

gaohoward commented Jun 14, 2023

reborn1867 commented Mar 22, 2024

aojea commented Mar 25, 2022 •

edited

aojea commented Mar 30, 2022 •

edited