Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-typha hangs during unclean client TLS handshake #7908

Closed
rodrigorfk opened this issue Aug 8, 2023 · 2 comments · Fixed by #7909
Closed

calico-typha hangs during unclean client TLS handshake #7908

rodrigorfk opened this issue Aug 8, 2023 · 2 comments · Fixed by #7909

Comments

@rodrigorfk
Copy link
Contributor

rodrigorfk commented Aug 8, 2023

On certain special conditions, a client TLS handshake can block the typha server indefinitely, as you can see here, the Handshake() call is performed inside the main server handle for loop, without any timeout, so a miss behaving client, with an unclean TLS handshake, can take the server down and block the main loop indefinitely while other connections will be setting idle and waiting for that handshake to eventually finish.

Expected Behavior

A misbehaving client should not impact the availability of the typha server, the TLS handshake must be performed with a safe upper timeout and eventually moved out of the main server handle for loop.

Current Behavior

The client TLS handshake is performed in the main server handle for loop without a safe timeout, misbehaving clients can impact the availability of the typha server if the connection is not clearly closed/aborted at client side.

Possible Solution

  1. Implement a timeout in the TLS handshake to the duration of one Ping interval.
  2. Eventually move the TLS handshake to an async go routine and unblock the main server handle for loop, so multiple client connections could be handled at the same time.

Steps to Reproduce (for bugs)

  1. Perform a few cycles of Kubernetes cluster growth and shrink, eventually one of the nodes running the calico-node client will be abruptly terminated during client TLS handshake, leaving the connection towards the calico-typha hanging in an unclean state during TLS handshake, the calico-typha will never, under a reasonable time, recover from that, the server will be blocked for days and eventually all calico-typha replicas will get to the same state, bringing the node network down.
  2. Another easy way to simulate this is to perform a port-forward to a calico-typha pod at port 5473 and then open a telnet connection to that port, the typha server will hang with a message like the following and other clients won't be able to connect to that typha server anymore:
2023-08-07 15:12:57.284 [INFO][7] sync_server.go 421: Accepted from xxx.xxx.xxx.xxx:60824 port=5473

Context

This issue can eventually bring a Kubernetes production cluster running Calico with calico-typha completely down, please keep in mind the error is only reproduced on certain special conditions, during race conditions in the client termination in the exact moment of a TLS handshake with the typha server, over time, that can affect all replicas of the typha deployment, bringing the availability of the cluster network down.

Your Environment

  • Calico version - v3.26.1
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes using the tigera-operator
  • Operating System and version: AWS EKS using Amazon EKS optimized Amazon Linux AMIs
  • Link to your project (optional):
@mazdakn
Copy link
Member

mazdakn commented Aug 22, 2023

@rodrigorfk Thanks for reporting and also providing the fix.

@anthonytwh
Copy link
Contributor

This issue as been assigned CVE-2023-41378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants