Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS handshake error from xx: read tcp xx -> xx: read: connection reset by peer #718

Closed
jonathan-innis opened this issue Mar 15, 2023 · 23 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jonathan-innis
Copy link
Member

Version

Karpenter Version: v0.27.0

Kubernetes Version: v1.24.0

Expected Behavior

Karpenter should not produce these TLS handshake errors generated by a bad connection from the control-plane kube-apiserver to the Karpenter defaulting/validating webhooks.

Actual Behavior

Karpenter generates TLS handshake errors on some calls that are processed through the control-plane kube-apiserver that are then routed to Karpenter's validating/defaulting webhooks. These TLS errors appear to be related to kubernetes/kubernetes#109022 which states that these handshake errors may be generated by some caching mechanism that is happening in the standard library that causes TLS errors on a cert rotation.

Steps to Reproduce the Problem

Unable to repro personally but have seen instances from users of this happening.

Resource Specs and Logs

N/A

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added the kind/bug Categorizes issue or PR as related to a bug. label Mar 15, 2023
@cdenneen
Copy link

cdenneen commented Jun 7, 2023

Confirming issue and initial testing seems like cert is fine.

2023/06/06 17:11:46 http: TLS handshake error from 10.224.18.55:34298: read tcp 10.224.17.42:8443->10.224.18.55:34298: read: connection reset by peer
2023/06/06 18:16:49 http: TLS handshake error from 10.224.18.55:48414: EOF
2023/06/06 21:06:58 http: TLS handshake error from 10.224.18.55:55770: EOF
2023/06/06 21:06:58 http: TLS handshake error from 10.224.18.55:55754: EOF
2023/06/06 21:42:01 http: TLS handshake error from 10.224.18.55:35352: EOF
2023/06/06 21:57:02 http: TLS handshake error from 10.224.18.55:40264: EOF
2023/06/07 02:42:24 http: TLS handshake error from 10.224.18.55:40458: EOF
2023/06/07 02:52:24 http: TLS handshake error from 10.224.18.55:38166: EOF
2023/06/07 06:32:38 http: TLS handshake error from 10.224.18.55:34314: EOF
2023/06/07 10:07:54 http: TLS handshake error from 10.224.18.55:33782: EOF
2023/06/07 11:27:58 http: TLS handshake error from 10.224.18.55:54596: EOF
2023/06/07 11:52:59 http: TLS handshake error from 10.224.18.55:49658: EOF
2023/06/07 12:38:02 http: TLS handshake error from 10.224.18.55:36280: read tcp 10.224.17.42:8443->10.224.18.55:36280: read: connection reset by peer
2023/06/07 12:48:03 http: TLS handshake error from 10.224.18.55:44072: read tcp 10.224.17.42:8443->10.224.18.55:44072: read: connection reset by peer
2023/06/07 12:58:04 http: TLS handshake error from 10.224.18.55:52500: EOF
$ kubectl get MutatingWebhookConfiguration defaulting.webhook.karpenter.k8s.aws -o jsonpath='{.webhooks[].clientConfig.caBundle}' | base64 -d > webhook-ca-cert.pem                           

$ kubectl get secret -n karpenter karpenter-cert -o jsonpath='{.data.server-cert\.pem}' | base64 -d > server-cert.pem

$ openssl verify -verbose -CAfile webhook-ca-cert.pem server-cert.pem
server-cert.pem: OK

$ kubectl port-forward -n karpenter service/karpenter  8443:443
Forwarding from 127.0.0.1:8443 -> 8443
Forwarding from [::1]:8443 -> 8443
Handling connection for 8443

$ curl -vvI https://karpenter.karpenter.svc:8443/default-resource --cacert webhook-ca-cert.pem --resolve karpenter.karpenter.svc:8443:127.0.0.1
* Added karpenter.karpenter.svc:8443:127.0.0.1 to DNS cache
* Hostname karpenter.karpenter.svc was found in DNS cache
*   Trying 127.0.0.1:8443...
* Connected to karpenter.karpenter.svc (127.0.0.1) port 8443 (#0)
* ALPN: offers h2,http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
*  CAfile: webhook-ca-cert.pem
*  CApath: none
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: O=knative.dev; CN=karpenter.karpenter.svc
*  start date: Jun  7 13:27:15 2023 GMT
*  expire date: Jun 14 13:27:15 2023 GMT
*  subjectAltName: host "karpenter.karpenter.svc" matched cert's "karpenter.karpenter.svc"
*  issuer: O=knative.dev; CN=karpenter.karpenter.svc
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: HEAD]
* h2h3 [:path: /default-resource]
* h2h3 [:scheme: https]
* h2h3 [:authority: karpenter.karpenter.svc:8443]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x2623cb0)
> HEAD /default-resource HTTP/2
> Host: karpenter.karpenter.svc:8443
> user-agent: curl/7.88.1
> accept: */*
> 
< HTTP/2 415 
HTTP/2 415 
< content-type: text/plain; charset=utf-8
content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
x-content-type-options: nosniff
< content-length: 46
content-length: 46
< date: Wed, 07 Jun 2023 13:43:10 GMT
date: Wed, 07 Jun 2023 13:43:10 GMT

< 
* Connection #0 to host karpenter.karpenter.svc left intact

@sirisaacnuketon
Copy link

We're seeing this too, we recently upgraded from v0.23.0 to v0.27.5 and are now seeing this in our logs. Is this possibly related to the changes to defaulting.webhook.karpenter.sh in v0.27.3?

@igoratencompass
Copy link

igoratencompass commented Jun 21, 2023

In my case I see invalid certificate verification:

$ kubectl get MutatingWebhookConfiguration defaulting.webhook.karpenter.k8s.aws -o jsonpath='{.webhooks[].clientConfig.caBundle}' | base64 -d > webhook-ca-cert.pem
$ kubectl get secret -n karpenter karpenter-cert -o jsonpath='{.data.server-cert\.pem}' | base64 -d > server-cert.pem
$ openssl verify -verbose -CAfile webhook-ca-cert.pem server-cert.pem
O = knative.dev, CN = karpenter.karpenter.svc
error 18 at 0 depth lookup: self-signed certificate
error server-cert.pem: verification failed

This is on EKS 1.25 and karpenter v0.27.5

UPDATE: Regarding the openssl failed verification I think it is happening due to the Subject and Issuer having the same DN value for both webhook-ca-cert.pem and server-cert.pem :

$ openssl x509 -noout -in webhook-ca-cert.pem -issuer -subject
issuer=O  = knative.dev, CN = karpenter.karpenter.svc
subject=O = knative.dev, CN = karpenter.karpenter.svc
 
$ openssl x509 -noout -in server-cert.pem -issuer -subject
issuer=O  = knative.dev, CN = karpenter.karpenter.svc
subject=O = knative.dev, CN = karpenter.karpenter.svc

in which case it is expected the cert to be a self-signed CA which is true for webhook-ca-cert.pem that has X509v3 Basic Constraints: critical, CA:TRUE but not for server-cert.pem that has X509v3 Basic Constraints: critical, CA:FALSE in its X509v3 extension. Not sure if this is intentional but that's what I see in the certs installed in my case.

@ellistarn
Copy link
Contributor

We're seeing this too, we recently upgraded from v0.23.0 to v0.27.5 and are now seeing this in our logs. Is this possibly related to the changes to defaulting.webhook.karpenter.sh in v0.27.3?

Can you confirm you successfully uninstalled the webhook defaulting.webhook.karpenter.sh? Some installers fail to remove these, which can cause the certs to go stale since karpenter is no longer refreshing them.

In my case I see invalid certificate verification:

Are there any relevant logs that suggest the certificate is failing to rotate? defaulting.webhook.karpenter.k8s.aws should be reconciled.

@sirisaacnuketon
Copy link

Can you confirm you successfully uninstalled the webhook defaulting.webhook.karpenter.sh? Some installers fail to remove these, which can cause the certs to go stale since karpenter is no longer refreshing them.

Yes we've uninstalled defaulting.webhook.karpenter.sh

Are there any relevant logs that suggest the certificate is failing to rotate? defaulting.webhook.karpenter.k8s.aws should be reconciled.

Not that I have found yet, but I can dig in further if necessary. I've checked the certificate and it is valid.

I wonder if it is related to this? kubernetes/kubernetes#109022
We've noted a number of mutating webhooks have this same issue.

@igoratencompass
Copy link

Are there any relevant logs that suggest the certificate is failing to rotate? defaulting.webhook.karpenter.k8s.aws should be reconciled.

Yes, I can see this in my logs:

$ kubectl logs -n karpenter -l app.kubernetes.io/instance=karpenter,app.kubernetes.io/name=karpenter -c controller | grep 'defaulting\.webhook\.karpenter\.k8s\.aws'
2023-06-21T08:33:41.450Z	ERROR	webhook.DefaultingWebhook	Reconcile error	{"commit": "698f22f-dirty", "knative.dev/traceid": "94a3a61f-6f8d-4f59-b49c-6cbbe1d1e36f", "knative.dev/key": "karpenter/karpenter-cert", "duration": "29.956983ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
2023-06-21T08:33:41.455Z	ERROR	webhook.DefaultingWebhook	Reconcile error	{"commit": "698f22f-dirty", "knative.dev/traceid": "545e6624-d50c-473e-aadc-a8b3260b8411", "knative.dev/key": "defaulting.webhook.karpenter.k8s.aws", "duration": "27.15996ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}

@igoratencompass
Copy link

We've noted a number of mutating webhooks have this same issue.

Yep, something is not right here I've noticed the same. For example this happened in the case of aws-load-balancer-controller some time ago:

$ kubectl get MutatingWebhookConfiguration aws-load-balancer-webhook -o jsonpath='{.webhooks[].clientConfig.caBundle}' | base64 -d | openssl x509 -noout -issuer -subject -dates
issuer=
subject=
notBefore=Feb 28 18:25:58 2023 GMT
notAfter=May 29 18:25:58 2023 GMT

notice how the cert has no issuer and subject. Weird things like that.

@stars693
Copy link

With karpenter v0.29.2 fresh install, I can see following errors in the logs

ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "cec912be-b21d-4c10-8fc5-f3f8097bb5f9", "knative.dev/key": "karpenter/karpenter-cert", "duration": "38.038178ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "05ae7945-b3d4-4fca-a76f-079e3a2b8204", "knative.dev/key": "karpenter/karpenter-cert", "duration": "38.388244ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.DefaultingWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "36eb492b-cff8-4395-94c7-ac3ad3b46d92", "knative.dev/key": "karpenter/karpenter-cert", "duration": "54.145729ms", "error": "failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"defaulting.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "ff69e2a6-2ebb-4f9b-bfe4-0f89e8687559", "knative.dev/key": "validation.webhook.karpenter.sh", "duration": "28.395394ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.ConfigMapWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "fc5cb94f-09f6-45cc-b70f-0fef3c4ef9e9", "knative.dev/key": "validation.webhook.config.karpenter.sh", "duration": "27.197942ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.config.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "263a2ca2-4fe9-476d-9d1b-7d0d89a9618c", "knative.dev/key": "karpenter/karpenter-cert", "duration": "24.777189ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
DEBUG	controller	http: TLS handshake error from 10.6.3.18:43240: read tcp 10.6.5.80:8443->10.6.3.18:43240: read: connection reset by peer	{"commit": "34d50bf-dirty"}

@yakir-shriker
Copy link

Karpenter 0.29 with EKS 1.24
{"level":"DEBUG","time":"2023-08-10T16:57:28.441Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:52238: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-10T18:52:51.262Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:42382: read tcp 10.50.109.225:8443->10.50.243.188:42382: read: connection reset by peer","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-10T22:48:35.306Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:34178: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-10T22:53:36.197Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:51612: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-11T07:25:10.074Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:47874: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-11T08:25:20.180Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:41436: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-11T14:46:27.424Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:35908: read tcp 10.50.109.225:8443->10.50.243.188:35908: read: connection reset by peer","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-11T14:56:29.335Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:48772: read tcp 10.50.109.225:8443->10.50.243.188:48772: read: connection reset by peer","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-11T18:37:08.474Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:55546: read tcp 10.50.109.225:8443->10.50.243.188:55546: read: connection reset by peer","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-12T05:24:11.844Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:48684: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-12T13:40:39.578Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:40028: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-12T23:57:30.343Z","logger":"controller","message":"http: TLS handshake error from 10.50.253.27:48050: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-13T01:07:42.479Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:49418: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-13T02:12:53.416Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:60414: EOF","commit":"61cc8f7-dirty"} {"level":"DEBUG","time":"2023-08-13T06:28:35.428Z","logger":"controller","message":"http: TLS handshake error from 10.50.243.188:56458: EOF","commit":"61cc8f7-dirty"}

@txjjjjj
Copy link

txjjjjj commented Sep 11, 2023

any update for this one?

@ellistarn
Copy link
Contributor

The current plan of record is to deprecate our webhooks and replace them with CEL. Can you confirm that this is just causing log spam, and no other negative side effects?

@txjjjjj
Copy link

txjjjjj commented Sep 15, 2023

The current plan of record is to deprecate our webhooks and replace them with CEL. Can you confirm that this is just causing log spam, and no other negative side effects?

so far yes

@den-is
Copy link

den-is commented Sep 26, 2023

I'm observing the same issue.
Running fresh EKS 1.24 and karpenter 0.30

I do not see these errors (or tell me if they are found somewhere else except karpenter pods)

ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "cec912be-b21d-4c10-8fc5-f3f8097bb5f9", "knative.dev/key": "karpenter/karpenter-cert", "duration": "38.038178ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.sh\": the object has been modified; please apply your changes to the latest version and try again"}
ERROR	webhook.ValidationWebhook	Reconcile error	{"commit": "34d50bf-dirty", "knative.dev/traceid": "05ae7945-b3d4-4fca-a76f-079e3a2b8204", "knative.dev/key": "karpenter/karpenter-cert", "duration": "38.388244ms", "error": "failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.karpenter.k8s.aws\": the object has been modified; please apply your changes to the latest version and try again"}
...

But I not that often I see error messages reported by others

karpenter-798789d575-m8gt2 controller 2023-09-26T02:18:38.201Z	ERROR	webhook	http: TLS handshake error from 10.59.224.175:36316: EOF
karpenter-798789d575-m8gt2 controller 	{"commit": "637a642"}
karpenter-798789d575-tvqqf controller 2023-09-26T05:28:35.710Z	ERROR	webhook	http: TLS handshake error from 10.59.224.175:53282: EOF
2023-09-26T05:28:35.710Z    ERROR    webhook    http: TLS handshake error from 10.59.224.175:53282: EOF

So far Karpenter is functionioning just fine, and I did not notice any critical issues.
Was really interested if this error is vital or I can live with it until the fix arrives.

@ajayOO8
Copy link

ajayOO8 commented Sep 26, 2023

I am also seeing this issue.

2023-09-26T13:06:11.436Z ERROR webhook http: TLS handshake error from 10.65.33.202:48808: EOF
2023-09-26T13:06:24.896Z ERROR webhook http: TLS handshake error from 10.65.1.231:33578: EOF

we recently migrated from v0.27 to v0.30.
I don't see any issues so far in terms of performance, just these annoying error messages, not sure how critical they are.

@gritzkoo
Copy link

gritzkoo commented Oct 3, 2023

EKS: 1.24
Karpenter: 0.23.0 -> 0.30.0
We got the same issue but the resolution was deleting the certificate using this reference aws/karpenter-provider-aws#1398 (comment)

@den-is
Copy link

den-is commented Oct 3, 2023

EKS: 1.24 Karpenter: 0.23.0 -> 0.30.0 We got the same issue but the resolution was deleting the certificate using this reference #1398 (comment)

Indeed, looks like "fresh reinstall" has helped.
In my case I was doing helm_release using terraform - many different experiments and reinstalls/overwrites, with the same helm release name. So probably karpenter-secret survived or never got updated.

2 hours and no error messages so far.

@ajayOO8
Copy link

ajayOO8 commented Oct 5, 2023

I also tried deleting CA cert(aws/karpenter-provider-aws#1398 (comment)) and reinstalling the helm release, but these errors didn't go away for me. I am still seeing these errors with same frequency.

@prein
Copy link

prein commented Oct 19, 2023

Unrelated to the core of the issue, is it also for everyone else that http: TLS handshake error from errors are plain text while all other logs are json (as expected per zap-logger-config configuration item "encoding": "json"

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
@seanturner026
Copy link

Is this something that is likely to go away with the move to the new beta in v0.32.x?

@RobCannon
Copy link

RobCannon commented Nov 3, 2023

I have completely cleaned up karpenter (deleted helm charts, crds and the namespace) and reinstalled. I still see the TLS errors from the webhook. My bigger concern is that the pods fail the health check and restart every few minutes.

I am running 0.32.1

@Hronom
Copy link

Hronom commented Nov 17, 2023

After upgrade to v0.32.1 this starts happening...
Brainless reinstall of helm chart(delete old karpenter chart, not karpenter-crd) and install new helped.

Don't afraid, deletion of helm chart not deletes worker nodes=)

@RobCannon
Copy link

I was using fargate as a way to run Karpenter and I gave up on that and I create a three node NoeGroup to make sure Karpenter has two different nodes to run on and one extra node for updates. That has gotten rid of all of my Karpenter issues for now.

@jonathan-innis
Copy link
Member Author

We should be able to close this now that we have released v0.33. Webhooks are fully dropped so we should stop seeing these TLS handshake errors if you enable DISABLE_WEBHOOK=true and drop the mutating webhook configurations and valdiation webhook configurations from your cluster 🎉

Closing this one for now since the newest version of Karpenter won't run into this problem by default. Please feel free to continue discussion on this one if you are running an older version of Karpenter with the webhooks and need more support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests