Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket hang up in AWS VPC #36499

Closed
TsvetanMilanov opened this issue Dec 13, 2021 · 9 comments · Fixed by #36532
Closed

Socket hang up in AWS VPC #36499

TsvetanMilanov opened this issue Dec 13, 2021 · 9 comments · Fixed by #36532

Comments

@TsvetanMilanov
Copy link

TsvetanMilanov commented Dec 13, 2021

Bug Description

I have a Node.js application deployed in an AWS EKS cluster and after deploying Istio in the cluster, I noticed sporadic socket hang up errors when making outgoing https requests from the Node.js application and keep alive is enabled in the https.Agent for those requests. There is no issue with http requests. The application is registered in Istio and all outgoing traffic is going through the istio-proxy.
I managed to create a very simple script which reproduces the issue consistently:

const https = require("https");
const agent = new https.Agent({ keepAlive: true, maxSockets: 50 });
const log = (data) => console.log(`[${new Date()}] ${data}`);
const get = async () => {
    return await new Promise((resolve, reject) => {
        log("starting request");
        const req = https.request({ method: "GET", host: "ipv4.icanhazip.com", port: 443, agent }, (res) => {
            res.on("error", reject);
            let body = '';
            res.on("data", (chunk) => { body += chunk.toString(); });
            res.on("end", () => { log(body); resolve() });
        });

        req.on("socket", (s) => {
            s.on("connect", () => {
                log(`socket: ${s.localAddress}:${s.localPort}`);
                log(`remote: ${s.remoteAddress}:${s.remotePort}`);
            });
        });

        req.on("error", reject);
        req.end();
    });
};

const main = async () => {
    await get();
    await new Promise((resolve) => setTimeout(resolve, 6 * 60 * 1000));
    await get();
};

main().then(log).catch(log);

Steps to reproduce:

  1. Deploy Istio using the Istio operator in EKS
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: repro
  namespace: istio-system
spec:
  hub: gcr.io/istio-release
  profile: default
  tag: 1.12.1
  meshConfig:
    accessLogFile: /dev/stdout
    connectTimeout: 30s
    defaultConfig:
      interceptionMode: TPROXY
      terminationDrainDuration: 30s
  1. Deploy the repro script in EKS:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-repro
  labels:
    app: istio-repro
    sidecar.istio.io/inject: "true"
spec:
  selector:
    matchLabels:
      app: istio-repro
  template:
    metadata:
      labels:
        app: istio-repro
        sidecar.istio.io/inject: "true"
    spec:
      containers:
        - name: repro
          image: tmilanov/istio-socket-hang-up-repro:0.1.0
          imagePullPolicy: Always
          command:
            - sleep
            - "1000000000"
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
  1. Execute the repro script
kubectl exec $(kubectl get po | grep istio-repro | awk '{print $1}') -c repro -- node repro.js

I tested with the steps above in AWS EKS and in Kubernetes cluster deployed locally using Kind. I reproduced the issue in AWS, but I was not able to reproduce it in the local Kubernetes cluster.

I wrote the same script in Golang to test if the problem is in Node.js and I reproduced the issue with the Golang script in AWS (you can run the Golang script by running kubectl exec $(kubectl get po | grep istio-repro | awk '{print $1}') -c repro -- go run repro.go)

I tested on EC2 instances with different Linux-based operating systems (Bottlerocket and Ubuntu) using a custom docker image - tmilanov/istio-socket-hang-up-troubleshoot:0.1.0. The image contains the scripts for reproducing the issue, envoy with config which only contains the PassthroughCluster and BlackHoleCluster config which Istio generates, iptables script which applies the output iptables config which Istio applies in the istio-init container. I reproduced the issue on both operating systems with this image.

I also tested on EC2 instances deployed in the public subnets of my VPC and I did not reproduce the issue. All other tests which reproduced the issue were executed from instances in my private subnets which use VPC NAT gateway when making requests to the internet.

I don't have a lot of experience with iptables and envoy and I'm not sure if the image which I created to test directly on EC2 instances contains the correct config which Istio generates, but if it can help investigating, here are the steps to reproduce using the custom docker image:

  1. docker run -ti --cap-add=NET_ADMIN tmilanov/istio-socket-hang-up-troubleshoot:0.1.0 bash
  2. ./iptables.sh
  3. ./start-envoy.sh
  4. node repro.js

My workaround is to use traffic.sidecar.istio.io/excludeOutboundPorts and exclude port 443 but I'd appreciate if someone with more experience could take a look at this issue.

Version

istioctl version:
client version: 1.11.3
control plane version: 1.12.1
data plane version: 1.12.1 (5 proxies)

kubectl version --short:
Client Version: v1.20.0
Server Version: v1.20.7-eks-d88609

Additional Information

No response

@bianpengyuan
Copy link
Contributor

Collecting trace level envoy log when this happen would help to find out the issue: https://github.com/istio/istio/wiki/Troubleshooting-Istio#collecting-information-2

@TsvetanMilanov
Copy link
Author

Here is the full trace log started before the first request and stopped after the second request fails -> https://gist.github.com/TsvetanMilanov/8958a747aa19a3f06d35bf791f3ebaf3
The first request starts at 08:54:10 UTC
The second request starts at 09:00:10 UTC
The destination is https://ipv4.icanhazip.com (104.18.114.97)

I noticed the following error read error: Connection reset by peer but I'm not sure if peer is the VPC NAT or the ipv4.icanhazip.com server, or iptables, or something else.

Here is the clusters config -> https://gist.github.com/TsvetanMilanov/3304f63bd37c2655cbfe11974cd8b378
Here are the stats after the second request fails -> https://gist.github.com/TsvetanMilanov/0d21f10091a7b8892fb2dae72224a309

@bianpengyuan
Copy link
Contributor

C52 is the connection from envoy to upstream (104.18.114.97), and looks like it probably gets some sort of idle time out (I don't see any activity between 08:54 and 09:00)? So probably AWS VPC NAT gateway has some configuration for that? Not sure why keep alive is not kicking in.

@TsvetanMilanov
Copy link
Author

I found the following in the AWS docs:

If a connection that's using a NAT gateway is idle for 350 seconds or more, the connection times out.

When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).

To prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can enable TCP keepalive on the instance with a value less than 350 seconds.

This seems to cause the issue, but somehow it doesn't affect http requests or requests which go directly to the NAT.

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Dec 15, 2021

Just learned from @lambdai that keepalive actually won't be passed through envoy. It is kernel to kernel for TCP. You will need to configure destination rule for your external service: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-TCPSettings, or do it mesh wide: https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig

@TsvetanMilanov
Copy link
Author

Creating ServiceEntry + DestinationRule fixed the issue. Adding tcpKeepalive in the meshConfig didn't help.
Unfortunately creating service entry and destination rule for each external service and port won't work in my case, because the external service and port are not static, they are provided by the users of the application.

@bianpengyuan
Copy link
Contributor

Creating ServiceEntry + DestinationRule fixed the issue. Adding tcpKeepalive in the meshConfig didn't help. Unfortunately creating service entry and destination rule for each external service and port won't work in my case, because the external service and port are not static, they are provided by the users of the application.

This is actually a bug, will be fixed by #36532

@TsvetanMilanov
Copy link
Author

I tested with 1.13-alpha.90282c9ff5c1c7dfae2c9f15d2a38bcd15b9aa2a and setting tcpKeepalive in the meshConfig works now and fixed the issue.

@johnzheng1975
Copy link
Member

Excellent investigation and testing. @TsvetanMilanov
Excellent fix. @bianpengyuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants