calico-typha hangs during unclean client TLS handshake #7908

rodrigorfk · 2023-08-08T10:25:40Z

On certain special conditions, a client TLS handshake can block the typha server indefinitely, as you can see here, the Handshake() call is performed inside the main server handle for loop, without any timeout, so a miss behaving client, with an unclean TLS handshake, can take the server down and block the main loop indefinitely while other connections will be setting idle and waiting for that handshake to eventually finish.

Expected Behavior

A misbehaving client should not impact the availability of the typha server, the TLS handshake must be performed with a safe upper timeout and eventually moved out of the main server handle for loop.

Current Behavior

The client TLS handshake is performed in the main server handle for loop without a safe timeout, misbehaving clients can impact the availability of the typha server if the connection is not clearly closed/aborted at client side.

Possible Solution

Implement a timeout in the TLS handshake to the duration of one Ping interval.
Eventually move the TLS handshake to an async go routine and unblock the main server handle for loop, so multiple client connections could be handled at the same time.

Steps to Reproduce (for bugs)

Perform a few cycles of Kubernetes cluster growth and shrink, eventually one of the nodes running the calico-node client will be abruptly terminated during client TLS handshake, leaving the connection towards the calico-typha hanging in an unclean state during TLS handshake, the calico-typha will never, under a reasonable time, recover from that, the server will be blocked for days and eventually all calico-typha replicas will get to the same state, bringing the node network down.
Another easy way to simulate this is to perform a port-forward to a calico-typha pod at port 5473 and then open a telnet connection to that port, the typha server will hang with a message like the following and other clients won't be able to connect to that typha server anymore:

2023-08-07 15:12:57.284 [INFO][7] sync_server.go 421: Accepted from xxx.xxx.xxx.xxx:60824 port=5473

Context

This issue can eventually bring a Kubernetes production cluster running Calico with calico-typha completely down, please keep in mind the error is only reproduced on certain special conditions, during race conditions in the client termination in the exact moment of a TLS handshake with the typha server, over time, that can affect all replicas of the typha deployment, bringing the availability of the cluster network down.

Your Environment

Calico version - v3.26.1
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes using the tigera-operator
Operating System and version: AWS EKS using Amazon EKS optimized Amazon Linux AMIs
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

mazdakn · 2023-08-22T16:39:29Z

@rodrigorfk Thanks for reporting and also providing the fix.

anthonytwh · 2023-11-06T15:18:46Z

This issue as been assigned CVE-2023-41378

rodrigorfk mentioned this issue Aug 8, 2023

fix: adding timeout to the typha server TLS handshake #7909

Merged

3 tasks

rodrigorfk mentioned this issue Aug 22, 2023

docs: adding timeout to the typha server TLS handshake tigera/docs#885

Merged

5 tasks

mazdakn added kind/bug likelihood/low impact/high labels Aug 22, 2023

fasaxc closed this as completed in #7909 Aug 23, 2023

GoVulnBot mentioned this issue Oct 31, 2023

x/vulndb: potential Go vuln in github.com/projectcalico/calico: CVE-2023-41377 golang/vulndb#2167

Closed

GoVulnBot mentioned this issue Nov 6, 2023

x/vulndb: potential Go vuln in github.com/projectcalico/calico: CVE-2023-41378 golang/vulndb#2178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-typha hangs during unclean client TLS handshake #7908

calico-typha hangs during unclean client TLS handshake #7908

rodrigorfk commented Aug 8, 2023 •

edited

Loading

mazdakn commented Aug 22, 2023

anthonytwh commented Nov 6, 2023

calico-typha hangs during unclean client TLS handshake #7908

calico-typha hangs during unclean client TLS handshake #7908

Comments

rodrigorfk commented Aug 8, 2023 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

mazdakn commented Aug 22, 2023

anthonytwh commented Nov 6, 2023

rodrigorfk commented Aug 8, 2023 •

edited

Loading