New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ungraceful node termination leads to 15mins of failing requests #33466
Comments
Ok after further investigations, I found out that this is related to the
We had to set the As you can see, "TCP Retransmissions" lasted about 3s, the value for the keepalive timeout. What I don't understand here, is that although we had already tried setting TCPKeepalive with Istio before, both mesh wide, and also in destination rules, it did not prevent the issue from happening for us. The only way to overcome this problem was to configure the client app. For anyone who ends up here with the similar issue, check the resources below for more info: Keeping this issue open for now, in case someone from the istio team can chime in and add more insights. |
Unfortunately, TCP keepalive is widely misunderstood. Its primary purpose is NOT to detect half-closed connections, despite the fact you will find a myriad of sources making this claim. Its real purpose is to keep an otherwise idle connection "alive" (hence the name) so that routers and firewalls do not drop it due to inactivity. That these keepalive probes can detect a half-closed connection is more of a side effect. If you are sending application-level data through the TCP socket, that counts as activity, even if it isn't receiving ACKs (and thus is stuck in a TCP retry loop). In this situation, TCP keepalives will NEVER be sent, and the connection will not be considered broken until you reach the OS-defined TCP retry timeout. For this reason, it is essential that you always include some application-level ping to detect half-closed connections. It is perfectly fine to do this in conjunction with TCP keepalives, but you must not ever rely solely on TCP keepalives. |
Thank you for the explanation @rittneje ! The difference between the 2 wasn't clear to me. |
@howardjohn Can you remove the stale label? |
not stale |
@howardjohn @kebe7jun Please remove the stale label. |
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-09-16. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
There is another option when you don't have the ability to modify the client app. You can force envoy to open the socket with option TCP_USER_TIMEOUT and this will force the kernel to close unack'd requests much sooner. There is a good writeup on this here: https://www.evanjones.ca/tcp-connection-timeouts.html Should also note, that retransmission behavior varies from platform to platform. We've experienced similar issues with openshift clusters, but EKS does not seem to be affected in the same way. Could be some logic related to tail loss probe. For those of us without control of client behavior, it is possible to work around this like so:
|
I get much meaningful info from this issue. Thanks. But I have a small question here: I think what you are talking about is TCP What I ref: |
Bug description
We are facing the exact same issue as in #28865 but with GKE (see my comment ) and I can reproduce it at will.
I have opened a new one because the istio-policy-bot closed it.
We use preemtible nodes for our workloads, and whenever a node gets preempted, we see a lot of failing requests and that lasts around 15min. This issue seems to be only affecting gRPC services.
For the sake of clarity, this issue does not occur when performing a rolling-update for a deployment, but only when a node is preempted.
We have tried tweaking the destination rule configuration, and the preStop hook to force Envoy to drain the listeners as suggested in #7136 but it doesn't seem to be working at all.
Below is our request flow:
When it happens, we see that the requests are failing on the "HTTP service" level, with 408 status codes.
Below are some screenshots from the HTTP service dashboard:
And when checking the metrics for the gRPC service during the same time period, we see that these requests are not even being routed to the gRPC service pods. This points to the requests being routed to the deleted pods, maybe trying to use the old TCP connection.
[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
[ ] Upgrade
Expected behavior
When a node is preempted, a new TCP connection should be used for subsequent requests and no requests should be routed to the dead pod(s).
Steps to reproduce the bug
I am just sending a request to get the health of the gRPC service here
I use the
gcloud
command below to simulate a preemption for a specific cluster node:I have 2 replicas of the gRPC service, below you can see the endpoints:
And only 1 replica of the HTTP service. Below is the output of the
istioctl proxy-config
command:Simulating node preemption
Here, I am going to simulate the preemption of one of the nodes running the gRPC service pods.
This time, all of the requests sent by curl are failing. Even after the new endpoint was added to eh k8s endpoints:
Although the endpoints are updated after a couple of seconds, that does not prevent the issue from happening and taking around 15min.
Below is the output for
kubectl get ep
for the
istioctl proxy-config command
Below are the logs from the
istio-proxy
container of the HTTP service pod:As for the new gRPC service pod, it is not receiving any traffic at all. This is the last log line from the istio-proxy container:
I noticed these log messages on the other (surviving) grpc-service istio-proxy container though:
I am not sure if these are related though.
Also, as a side note, we have been seeing inconsistent behaviour when it comes to load balancing for inter-service communication: from http-service to grpc-service. As I explained above, all of the requests were failing after causing the node preemption, even though there was one grpc-service pod that was still running. We can also see that by tailing the logs of both Pods and see the requests hitting only one replica out of the running ones.
We're using
ROUND_ROBIN
simple load balancing in the destination rule for the service, so we expect only around 50% of the requests to fail. However, sometimes no requests at all fail, and sometimes all of them.Version
Istio
kubernetes
helm
How was Istio installed?
The bug is reproducible on 2 different environments. On one of the environments, Istio was installed using the helm chart, on the second, using the operator.
Environment where the bug was observed (cloud vendor, OS, etc)
We use GKE
OS: Google's Container-Optimized OS with Docker (cos)
Please let me know if you need any more info.
The text was updated successfully, but these errors were encountered: