You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On certain special conditions, a client TLS handshake can block the typha server indefinitely, as you can see here, the Handshake() call is performed inside the main server handle for loop, without any timeout, so a miss behaving client, with an unclean TLS handshake, can take the server down and block the main loop indefinitely while other connections will be setting idle and waiting for that handshake to eventually finish.
Expected Behavior
A misbehaving client should not impact the availability of the typha server, the TLS handshake must be performed with a safe upper timeout and eventually moved out of the main server handle for loop.
Current Behavior
The client TLS handshake is performed in the main server handle for loop without a safe timeout, misbehaving clients can impact the availability of the typha server if the connection is not clearly closed/aborted at client side.
Possible Solution
Implement a timeout in the TLS handshake to the duration of one Ping interval.
Eventually move the TLS handshake to an async go routine and unblock the main server handle for loop, so multiple client connections could be handled at the same time.
Steps to Reproduce (for bugs)
Perform a few cycles of Kubernetes cluster growth and shrink, eventually one of the nodes running the calico-node client will be abruptly terminated during client TLS handshake, leaving the connection towards the calico-typha hanging in an unclean state during TLS handshake, the calico-typha will never, under a reasonable time, recover from that, the server will be blocked for days and eventually all calico-typha replicas will get to the same state, bringing the node network down.
Another easy way to simulate this is to perform a port-forward to a calico-typha pod at port 5473 and then open a telnet connection to that port, the typha server will hang with a message like the following and other clients won't be able to connect to that typha server anymore:
2023-08-07 15:12:57.284 [INFO][7] sync_server.go 421: Accepted from xxx.xxx.xxx.xxx:60824 port=5473
Context
This issue can eventually bring a Kubernetes production cluster running Calico with calico-typha completely down, please keep in mind the error is only reproduced on certain special conditions, during race conditions in the client termination in the exact moment of a TLS handshake with the typha server, over time, that can affect all replicas of the typha deployment, bringing the availability of the cluster network down.
Your Environment
Calico version - v3.26.1
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes using the tigera-operator
Operating System and version: AWS EKS using Amazon EKS optimized Amazon Linux AMIs
Link to your project (optional):
The text was updated successfully, but these errors were encountered:
On certain special conditions, a client TLS handshake can block the typha server indefinitely, as you can see here, the
Handshake()
call is performed inside the main server handlefor
loop, without any timeout, so a miss behaving client, with an unclean TLS handshake, can take the server down and block the main loop indefinitely while other connections will be setting idle and waiting for that handshake to eventually finish.Expected Behavior
A misbehaving client should not impact the availability of the typha server, the TLS handshake must be performed with a safe upper timeout and eventually moved out of the main server handle
for
loop.Current Behavior
The client TLS handshake is performed in the main server handle
for
loop without a safe timeout, misbehaving clients can impact the availability of the typha server if the connection is not clearly closed/aborted at client side.Possible Solution
for
loop, so multiple client connections could be handled at the same time.Steps to Reproduce (for bugs)
calico-node
client will be abruptly terminated during client TLS handshake, leaving the connection towards thecalico-typha
hanging in an unclean state during TLS handshake, thecalico-typha
will never, under a reasonable time, recover from that, the server will be blocked for days and eventually allcalico-typha
replicas will get to the same state, bringing the node network down.port-forward
to acalico-typha
pod at port5473
and then open atelnet
connection to that port, the typha server will hang with a message like the following and other clients won't be able to connect to that typha server anymore:Context
This issue can eventually bring a Kubernetes production cluster running Calico with calico-typha completely down, please keep in mind the error is only reproduced on certain special conditions, during race conditions in the client termination in the exact moment of a TLS handshake with the typha server, over time, that can affect all replicas of the typha deployment, bringing the availability of the cluster network down.
Your Environment
v3.26.1
kubernetes
using thetigera-operator
The text was updated successfully, but these errors were encountered: