Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ungraceful node termination leads to 15mins of failing requests #33466

Closed
mehyedes opened this issue Jun 16, 2021 · 10 comments
Closed

Ungraceful node termination leads to 15mins of failing requests #33466

mehyedes opened this issue Jun 16, 2021 · 10 comments
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@mehyedes
Copy link

mehyedes commented Jun 16, 2021

Bug description
We are facing the exact same issue as in #28865 but with GKE (see my comment ) and I can reproduce it at will.

I have opened a new one because the istio-policy-bot closed it.

We use preemtible nodes for our workloads, and whenever a node gets preempted, we see a lot of failing requests and that lasts around 15min. This issue seems to be only affecting gRPC services.

For the sake of clarity, this issue does not occur when performing a rolling-update for a deployment, but only when a node is preempted.

We have tried tweaking the destination rule configuration, and the preStop hook to force Envoy to drain the listeners as suggested in #7136 but it doesn't seem to be working at all.

Below is our request flow:

         +----------------+          +------------------+      +-----------------+
         |                |          |                  |      |                 |
         |                |          |                  |      |                 |
---------> ingress-gateway+----------> HTTP service     +---X---> gRPC service    |
         |                |          |                  |      |                 |
         |                |          |                  |      |                 |
         +----------------+          +------------------+      +-----------------+

When it happens, we see that the requests are failing on the "HTTP service" level, with 408 status codes.

Below are some screenshots from the HTTP service dashboard:

Screenshot 2021-06-11 at 17 27 34
Screenshot 2021-06-11 at 17 27 52

And when checking the metrics for the gRPC service during the same time period, we see that these requests are not even being routed to the gRPC service pods. This points to the requests being routed to the deleted pods, maybe trying to use the old TCP connection.

[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
[ ] Upgrade

Expected behavior
When a node is preempted, a new TCP connection should be used for subsequent requests and no requests should be routed to the dead pod(s).

Steps to reproduce the bug
I am just sending a request to get the health of the gRPC service here

curl  https://api.redacted.com/status/health\?service\=grpc-service
{"status":"OK"}%

I use the gcloud command below to simulate a preemption for a specific cluster node:

gcloud compute instances simulate-maintenance-event ...

I have 2 replicas of the gRPC service, below you can see the endpoints:

NAME                   ENDPOINTS                          AGE
grpc-service   10.15.1.15:5000,10.15.1.199:5000   102d

And only 1 replica of the HTTP service. Below is the output of the istioctl proxy-config command:

istioctl proxy-config endpoints http-pod --cluster "outbound|5000|v1|grpc-service.ns.svc.cluster.local"
ENDPOINT             STATUS      OUTLIER CHECK     CLUSTER
10.15.1.15:5000      HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local
10.15.1.199:5000     HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local

Simulating node preemption

Here, I am going to simulate the preemption of one of the nodes running the gRPC service pods.

This time, all of the requests sent by curl are failing. Even after the new endpoint was added to eh k8s endpoints:

curl  https://api.redacted.com/status/health\?service\=grpc-service
{"code":"request_timeout","details":[],"message":"The request has timed out."}

Although the endpoints are updated after a couple of seconds, that does not prevent the issue from happening and taking around 15min.

Below is the output for kubectl get ep

k -n ns get ep grpc-service -w
NAME                   ENDPOINTS                          AGE
grpc-service   10.15.1.15:5000,10.15.1.199:5000   102d
grpc-service   10.15.1.15:5000                    102d
grpc-service   10.15.1.15:5000                    102d
grpc-service   10.15.0.134:5000,10.15.1.15:5000   102d

for the istioctl proxy-config command

# before the preemption
istioctl proxy-config endpoints http-pod --cluster "outbound|5000|v1|grpc-service.ns.svc.cluster.local"
ENDPOINT             STATUS      OUTLIER CHECK     CLUSTER
10.15.1.15:5000      HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local
10.15.1.199:5000     HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local

# endpoint removed
ENDPOINT            STATUS      OUTLIER CHECK     CLUSTER
10.15.1.15:5000     HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local

#new endpoint added
ENDPOINT             STATUS      OUTLIER CHECK     CLUSTER
10.15.0.134:5000     HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local
10.15.1.15:5000      HEALTHY     OK                outbound|5000|v1|grpc-service.ns.svc.cluster.local

Below are the logs from the istio-proxy container of the HTTP service pod:

[2021-06-15T13:49:11.149Z] "GET /status/health?service=grpc-service HTTP/2" 408 - via_upstream - "-" 0 78 1004 1003 "10.14.0.56" "curl/7.64.1" "76e17e28-03f9-4b27-99b7-aca1e6b42b5e" "api.redacted.com" "127.0.0.1:3000" inbound|3000|| 127.0.0.1:41928 10.15.2.13:3000 10.14.0.56:0 outbound_.3000_.v1_.http-service.ns.svc.cluster.local default
[2021-06-15T13:49:12.162Z] "- - -" 0 - - - "-" 1560 3928 9 - "-" "-" "-" "-" "X.X.X.X:443" PassthroughCluster 10.15.2.13:58358 X.X.X.X:443 10.15.2.13:58356 - -
[2021-06-15T13:50:44.321Z] "GET /status/health?service=grpc-service HTTP/2" 408 - via_upstream - "-" 0 78 1004 1004 "10.14.0.15" "curl/7.64.1" "4679dc33-0ae7-933a-b356-7283af06e8e4" "api.redacted.com" "127.0.0.1:3000" inbound|3000|| 127.0.0.1:44072 10.15.2.13:3000 10.14.0.15:0 outbound_.3000_.v1_.http-service.ns.svc.cluster.local default

As for the new gRPC service pod, it is not receiving any traffic at all. This is the last log line from the istio-proxy container:

{"level":"info","time":"2021-06-15T13:46:56.395111Z","msg":"Envoy proxy is ready"}

I noticed these log messages on the other (surviving) grpc-service istio-proxy container though:

{"level":"warning","time":"2021-06-15T13:45:25.750788Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed: 0, "}
{"level":"warning","time":"2021-06-15T13:45:25.865083Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed: 14, connection error: desc = "transport: Error while dialing dial tcp 10.16.15.208:15012: connect: connection refused""}
{"level":"info","time":"2021-06-15T13:45:26.848565Z","scope":"xdsproxy","msg":"connected to upstream XDS server: istiod.istio-system.svc:15012"}

I am not sure if these are related though.

Also, as a side note, we have been seeing inconsistent behaviour when it comes to load balancing for inter-service communication: from http-service to grpc-service. As I explained above, all of the requests were failing after causing the node preemption, even though there was one grpc-service pod that was still running. We can also see that by tailing the logs of both Pods and see the requests hitting only one replica out of the running ones.
We're using ROUND_ROBIN simple load balancing in the destination rule for the service, so we expect only around 50% of the requests to fail. However, sometimes no requests at all fail, and sometimes all of them.

Version
Istio

client version: 1.9.0
control plane version: 1.9.1
data plane version: 1.9.1 (54 proxies)

kubernetes

Client Version: v1.20.4
Server Version: v1.20.6-gke.1400

helm

v3.5.2+g167aac7

How was Istio installed?
The bug is reproducible on 2 different environments. On one of the environments, Istio was installed using the helm chart, on the second, using the operator.

Environment where the bug was observed (cloud vendor, OS, etc)
We use GKE
OS: Google's Container-Optimized OS with Docker (cos)

Please let me know if you need any more info.

@mehyedes
Copy link
Author

Update:
I have disabled mtls on one namespace in order to have more visibility, and here is what I captured with tcpdump where:

  • 10.16.0.227 is the IP address of the gRPC service
  • 10.15.2.145 is the IP address of the HTTP service

The packets on the Wireshark screenshot are are all part of the same TCP stream.
After starting the capture, I have sent 3 successful requests to the grpc service through the http service.
I have also noticed that only of the 2 pods of the grpc service is receiving the requests. So, I have caused a node preemption for the node where the grpc service pod is running, and requests started timing out.
Which is displayed here through all of those TCP retransmissions. But then, nothing happens, and the client does not even attempt to establish a new connection with the other healthy pod.
Screenshot 2021-06-17 at 13 17 27

@mehyedes
Copy link
Author

Ok after further investigations, I found out that this is related to the TCP_USER_TIMEOUT socket option, which is set to 15 minutes by default on our Kubernetes worker nodes:

$ sysctl -a | grep tcp_retries2
net.ipv4.tcp_retries2 = 15

The number of retries the kernel is willing to do before giving up on a connection is governed by net.ipv4.tcp_retries2. The timeout between retries is calculated dynamically and is based on the RTT of the connection, but with our defaults results in around 15-20 minutes in total.

We had to set the grpc.keepalive_timeout_ms value on the client application (the http-service). When we tried to reproduce the issue this time, we noticed that after 3s(the timeout we set), the client application established a new connection and we saw no more failing requests. Below is a screenshot, that displays that behaviour:

Screenshot 2021-06-18 at 10 03 06

As you can see, "TCP Retransmissions" lasted about 3s, the value for the keepalive timeout.

What I don't understand here, is that although we had already tried setting TCPKeepalive with Istio before, both mesh wide, and also in destination rules, it did not prevent the issue from happening for us. The only way to overcome this problem was to configure the client app.
I thought it should have the same effect, but maybe I am missing something here. I would appreciate it if someone from the istio team can explain that behaviour 🤔

For anyone who ends up here with the similar issue, check the resources below for more info:
https://www.hostedgraphite.com/blog/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md

Keeping this issue open for now, in case someone from the istio team can chime in and add more insights.

@rittneje
Copy link

rittneje commented Jun 18, 2021

What I don't understand here, is that although we had already tried setting TCPKeepalive with Istio before, both mesh wide, and also in destination rules, it did not prevent the issue from happening for us.

Unfortunately, TCP keepalive is widely misunderstood. Its primary purpose is NOT to detect half-closed connections, despite the fact you will find a myriad of sources making this claim. Its real purpose is to keep an otherwise idle connection "alive" (hence the name) so that routers and firewalls do not drop it due to inactivity. That these keepalive probes can detect a half-closed connection is more of a side effect.

If you are sending application-level data through the TCP socket, that counts as activity, even if it isn't receiving ACKs (and thus is stuck in a TCP retry loop). In this situation, TCP keepalives will NEVER be sent, and the connection will not be considered broken until you reach the OS-defined TCP retry timeout.

For this reason, it is essential that you always include some application-level ping to detect half-closed connections. It is perfectly fine to do this in conjunction with TCP keepalives, but you must not ever rely solely on TCP keepalives.

@mehyedes
Copy link
Author

Thank you for the explanation @rittneje ! The difference between the 2 wasn't clear to me.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 14, 2021
@rittneje
Copy link

@howardjohn Can you remove the stale label?

@kebe7jun
Copy link
Member

not stale

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Sep 16, 2021
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Dec 15, 2021
@rittneje
Copy link

@howardjohn @kebe7jun Please remove the stale label.

@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-09-16. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@adams-shaun
Copy link
Contributor

@mehyedes

There is another option when you don't have the ability to modify the client app. You can force envoy to open the socket with option TCP_USER_TIMEOUT and this will force the kernel to close unack'd requests much sooner. There is a good writeup on this here: https://www.evanjones.ca/tcp-connection-timeouts.html

Should also note, that retransmission behavior varies from platform to platform. We've experienced similar issues with openshift clusters, but EKS does not seem to be affected in the same way. Could be some logic related to tail loss probe.

For those of us without control of client behavior, it is possible to work around this like so:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: sampleoptions
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      context: SIDECAR_OUTBOUND
      cluster:
        name: "outbound|12345||foo.ns.svc.cluster.local"
    patch:
      operation: MERGE
      value:
        upstream_bind_config:
          source_address:
            address: "0.0.0.0"
            port_value: 0
            protocol: TCP
          socket_options:
          - name: 18
            int_value: 10000
            level: 6

@labilezhu
Copy link

I get much meaningful info from this issue. Thanks. But I have a small question here:

I think what you are talking about is TCP Half-Open connections. Not Half-Closed connections. I am sorry, maybe I am just a nerd.

What I ref:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

6 participants