Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration tests log many "http: TLS handshake error from 127.0.0.1:55336: EOF" errors #109022

Open
liggitt opened this issue Mar 25, 2022 · 21 comments
Labels
area/test kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt
Copy link
Member

liggitt commented Mar 25, 2022

What happened?

Ran

go test ./test/integration/apiserver/admissionwebhook -v -run TestWebhookAdmissionWithWatchCache

What did you expect to happen?

tests run without TLS errors

How can we reproduce it (as minimally and precisely as possible)?

Run integration tests

Anything else we need to know?

This happened on go1.17 and go1.18, so it's not new, but indicates we either have a setup issue in our integration tests, or our logging is outputting errors in situations that should not error

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 25, 2022
@k8s-ci-robot
Copy link
Contributor

@liggitt: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2022
@liggitt liggitt added area/test and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2022
@k8s-ci-robot
Copy link
Contributor

@liggitt: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liggitt liggitt added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 25, 2022
@aojea
Copy link
Member

aojea commented Mar 25, 2022

/cc

@aojea
Copy link
Member

aojea commented Mar 25, 2022

seems to be coming from this poll loops

err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})

err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {

err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})

the client retries internally the EOF

@liggitt
Copy link
Member Author

liggitt commented Mar 25, 2022

that's... weird... I don't see any reason those would hit TLS EOF errors

@aojea
Copy link
Member

aojea commented Mar 25, 2022

diff --git a/test/integration/apiserver/admissionwebhook/admission_test.go b/test/integration/apiserver/admissionwebhook/admission_test.go
index 4c64bdca26f..031e49326c8 100644
--- a/test/integration/apiserver/admissionwebhook/admission_test.go
+++ b/test/integration/apiserver/admissionwebhook/admission_test.go
@@ -275,7 +275,7 @@ func (h *holder) record(version string, phase string, converted bool, request *a
        defer h.lock.Unlock()
 
        // this is useful to turn on if items aren't getting recorded and you need to figure out why
-       debug := false
+       debug := true
        if debug {
                h.t.Logf("%s %#v %v", request.Operation, request.Resource, request.SubResource)
        }
@@ -733,6 +733,7 @@ func testResourceDelete(c *testContext) {
 
        // wait for the item to be gone
        err = wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
+               fmt.Println("DEBUG testResourceDelete")
                obj, err := c.client.Resource(c.gvr).Namespace(obj.GetNamespace()).Get(context.TODO(), obj.GetName(), metav1.GetOptions{})
                if apierrors.IsNotFound(err) {
                        return true, nil
@@ -747,6 +748,7 @@ func testResourceDelete(c *testContext) {
                c.t.Error(err)
                return
        }
+       fmt.Println("DEBUG testResourceDelete FINISH")
    admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1beta1", phase:"validation", converted:false} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:280: DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:323: recording: admissionwebhook.webhookOptions{version:"v1", phase:"validation", converted:true} = DELETE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
DEBUG testResourceDelete
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56638: EOF
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56634: EOF
DEBUG testResourceDelete FINISH
2022/03/25 16:39:36 http: TLS handshake error from 127.0.0.1:56636: EOF
    admission_test.go:280: CREATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 

@aojea
Copy link
Member

aojea commented Mar 25, 2022

I0325 18:35:16.399766  444569 pathrecorder.go:248] apiextensions-apiserver: "/apis/random.numbers.com/v1/integers/fortytwo" satisfied by prefix /apis/
I0325 18:35:16.400329  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1beta1/mutation?timeout=30s
I0325 18:35:16.400337  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.400345  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
I0325 18:35:16.400352  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.400359  444569 round_trippers.go:473]     Content-Type: application/json
    admission_test.go:281: UPDATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56520: EOF
    admission_test.go:297: {random.numbers.com v1 integers} != {  }
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56522: EOF
I0325 18:35:16.400604  444569 round_trippers.go:574] Response Status: 200 OK in 0 milliseconds
2022/03/25 18:35:16 http: TLS handshake error from 127.0.0.1:56524: EOF
I0325 18:35:16.400950  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1beta1/convert/mutation?timeout=30s
I0325 18:35:16.400959  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.400972  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.400987  444569 round_trippers.go:473]     Content-Type: application/json
I0325 18:35:16.401000  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
    admission_test.go:281: UPDATE v1.GroupVersionResource{Group:"random.numbers.com", Version:"v1", Resource:"integers"} 
    admission_test.go:297: {random.numbers.com v1 integers} != {  }
I0325 18:35:16.401255  444569 round_trippers.go:574] Response Status: 200 OK in 0 milliseconds
I0325 18:35:16.401598  444569 round_trippers.go:463] POST https://127.0.0.1:46149/v1/mutation?timeout=10s
I0325 18:35:16.401607  444569 round_trippers.go:469] Request Headers:
I0325 18:35:16.401615  444569 round_trippers.go:473]     Accept: application/json, */*
I0325 18:35:16.401623  444569 round_trippers.go:473]     User-Agent: kube-apiserver-admission
I0325 18:35:16.401631  444569 round_trippers.go:473]     Content-Type: application/json

@aojea
Copy link
Member

aojea commented Mar 25, 2022

who is logging those lines with the EOF error?

@liggitt
Copy link
Member Author

liggitt commented Mar 25, 2022

who is logging those lines with the EOF error?

http.Server

@aojea
Copy link
Member

aojea commented Mar 28, 2022

I've found the cause, but there are several things I'd like to sort out first

@aojea
Copy link
Member

aojea commented Mar 30, 2022

Webhooks has a very interesting and complex setup, let me write it down for reference:

Webhooks inside the apiserver use a RESTClient to contact the webhooks, this is handled by the ClientManager

func NewClientManager(gvs []schema.GroupVersion, addToSchemaFuncs ...func(s *runtime.Scheme) error) (ClientManager, error) {
cache, err := lru.New(defaultCacheSize)

conversion and validationg/mutation webhooks doesn't use the same client manager though

$ grep -r NewClientManager staging/
staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/webhook_converter.go:       clientManager, err := webhook.NewClientManager(
staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/generic/webhook.go:   cm, err := webhookutil.NewClientManager(
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:// NewClientManager creates a clientManager.
staging/src/k8s.io/apiserver/pkg/util/webhook/client.go:func NewClientManager(gvs []schema.GroupVersion, addToSchemaFuncs ...func(s *runtime.Scheme) error) (ClientMa

The ClientManager caches the clients, however, the key takes the URL path into account, that means that same host will not be cached if they have different path, i.e. webhook.url/validatingv1 and webhook.url/validatingv1beta1 will have a different client

ccWithNoName := cc
ccWithNoName.Name = ""
cacheKey, err := json.Marshal(ccWithNoName)

but, interestingly, the transport is cached ONLY for webhooks using URL, because the transport is cacheable for the Client

if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {

webhooks using a Service use a custom dialer, so the transport is not cached, but (I have to verify it) the client will be cached by the ClientManager cache

if cc.Service != nil {
restConfig, err := cm.authInfoResolver.ClientConfigForService(cc.Service.Name, cc.Service.Namespace)
if err != nil {
return nil, err
}
cfg := rest.CopyConfig(restConfig)
serverName := cc.Service.Name + "." + cc.Service.Namespace + ".svc"
host := serverName + ":443"
cfg.Host = "https://" + host
cfg.APIPath = cc.Service.Path
// Set the server name if not already set
if len(cfg.TLSClientConfig.ServerName) == 0 {
cfg.TLSClientConfig.ServerName = serverName
}
delegateDialer := cfg.Dial
if delegateDialer == nil {
var d net.Dialer
delegateDialer = d.DialContext
}
cfg.Dial = func(ctx context.Context, network, addr string) (net.Conn, error) {

Regarding the http: TLS handshake error from 127.0.0.1:56638: EOF errors, I was surprised it wasn't happening on this test too

func TestWebhookLoadBalance(t *testing.T) {

but if we just remove the sleep on the webhook server

diff --git a/test/integration/apiserver/admissionwebhook/load_balance_test.go b/test/integration/apiserver/admissionwebhook/load_balance_test.go
index 0ed6c3911f3..42369d5c780 100644
--- a/test/integration/apiserver/admissionwebhook/load_balance_test.go
+++ b/test/integration/apiserver/admissionwebhook/load_balance_test.go
@@ -276,7 +276,7 @@ func newLoadBalanceWebhookHandler(recorder *connectionRecorder) http.Handler {
                }
 
                // simulate a loaded backend
-               time.Sleep(2 * time.Second)
+               //time.Sleep(2 * time.Second)
                allow(w)
        })
 }

the test fails with the TLS handshake errors

2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57444: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57440: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57442: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57446: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57452: EOF
    load_balance_test.go:208: expected no additional connections (reusing kept-alive connections), got 9
E0330 18:27:45.585947  426498 context.go:98] "Attempted to add audit annotations from unsupported request chain" annotations=[authorization.k8s.io/decision allow authorization.k8s.io/reason ]
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57448: EOF
2022/03/30 18:27:45 http: TLS handshake error from 127.0.0.1:57450: EOF

based on that I think that this problem is related to

golang/go#50984

I've tried different things without success, I think that the divergence between URL and Service webhooks and the multiple caching layers can be problematic in a future (if it is not a problem already :) )

@aojea
Copy link
Member

aojea commented Mar 30, 2022

/priority important-soon

@irbekrm
Copy link

irbekrm commented Jun 6, 2022

We also see these errors a lot in cert-manager webhook logs when deployed on Kubernetes 1.23 or 1.24 when a larger number of resources are being applied that get validated by webhook.

We also see some 'connection reset' errors that also appear to be new in 1.23 and 1.24, not sure if those might be related:

...
I0606 08:43:29.003459       1 logs.go:59] http: TLS handshake error from 10.244.0.1:23178: EOF
I0606 08:43:29.013764       1 logs.go:59] http: TLS handshake error from 10.244.0.1:8326: read tcp 10.244.0.5:10250->10.244.0.1:8326: read: connection reset by peer
I0606 08:43:29.014414       1 logs.go:59] http: TLS handshake error from 10.244.0.1:14238: EOF
...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022
@ritazh
Copy link
Member

ritazh commented Sep 9, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2022
@cyniczhi
Copy link

I also met this problem when deploy my own admission webhook, the kubernetes version is v1.24.6. Is there any solution or workaround to make my webhook workable?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2023
@sozercan
Copy link

sozercan commented Mar 6, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2023
@enj
Copy link
Member

enj commented Mar 20, 2023

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 20, 2023
@gaohoward
Copy link

we are seeing this log in our operator which has a built-in webhook for conversion/validating/mutating. It seems doesn't affect the functionalities of the webhook but it keeps appearing in log like forever. As the log shows no other info as to the source of the log we have no idea where to ask for a fix. The go version is 1.19 and kubernetes version is 1.26.3. I'd be glad to provide more information if required. Thanks!

@reborn1867
Copy link

Encountered the same issue in our webhook server, I would like to know if there's any workaround/improvement to make the server more resilient, since it seems not working to simply add retry mechanism on client side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests