Socket hang up in AWS VPC #36499

TsvetanMilanov · 2021-12-13T22:27:07Z

Bug Description

I have a Node.js application deployed in an AWS EKS cluster and after deploying Istio in the cluster, I noticed sporadic socket hang up errors when making outgoing https requests from the Node.js application and keep alive is enabled in the https.Agent for those requests. There is no issue with http requests. The application is registered in Istio and all outgoing traffic is going through the istio-proxy.
I managed to create a very simple script which reproduces the issue consistently:

const https = require("https");
const agent = new https.Agent({ keepAlive: true, maxSockets: 50 });
const log = (data) => console.log(`[${new Date()}] ${data}`);
const get = async () => {
    return await new Promise((resolve, reject) => {
        log("starting request");
        const req = https.request({ method: "GET", host: "ipv4.icanhazip.com", port: 443, agent }, (res) => {
            res.on("error", reject);
            let body = '';
            res.on("data", (chunk) => { body += chunk.toString(); });
            res.on("end", () => { log(body); resolve() });
        });

        req.on("socket", (s) => {
            s.on("connect", () => {
                log(`socket: ${s.localAddress}:${s.localPort}`);
                log(`remote: ${s.remoteAddress}:${s.remotePort}`);
            });
        });

        req.on("error", reject);
        req.end();
    });
};

const main = async () => {
    await get();
    await new Promise((resolve) => setTimeout(resolve, 6 * 60 * 1000));
    await get();
};

main().then(log).catch(log);

Steps to reproduce:

Deploy Istio using the Istio operator in EKS

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: repro
  namespace: istio-system
spec:
  hub: gcr.io/istio-release
  profile: default
  tag: 1.12.1
  meshConfig:
    accessLogFile: /dev/stdout
    connectTimeout: 30s
    defaultConfig:
      interceptionMode: TPROXY
      terminationDrainDuration: 30s

Deploy the repro script in EKS:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-repro
  labels:
    app: istio-repro
    sidecar.istio.io/inject: "true"
spec:
  selector:
    matchLabels:
      app: istio-repro
  template:
    metadata:
      labels:
        app: istio-repro
        sidecar.istio.io/inject: "true"
    spec:
      containers:
        - name: repro
          image: tmilanov/istio-socket-hang-up-repro:0.1.0
          imagePullPolicy: Always
          command:
            - sleep
            - "1000000000"
          resources:
            limits:
              cpu: 100m
              memory: 128Mi

Execute the repro script

kubectl exec $(kubectl get po | grep istio-repro | awk '{print $1}') -c repro -- node repro.js

I tested with the steps above in AWS EKS and in Kubernetes cluster deployed locally using Kind. I reproduced the issue in AWS, but I was not able to reproduce it in the local Kubernetes cluster.

I wrote the same script in Golang to test if the problem is in Node.js and I reproduced the issue with the Golang script in AWS (you can run the Golang script by running kubectl exec $(kubectl get po | grep istio-repro | awk '{print $1}') -c repro -- go run repro.go)

I tested on EC2 instances with different Linux-based operating systems (Bottlerocket and Ubuntu) using a custom docker image - tmilanov/istio-socket-hang-up-troubleshoot:0.1.0. The image contains the scripts for reproducing the issue, envoy with config which only contains the PassthroughCluster and BlackHoleCluster config which Istio generates, iptables script which applies the output iptables config which Istio applies in the istio-init container. I reproduced the issue on both operating systems with this image.

I also tested on EC2 instances deployed in the public subnets of my VPC and I did not reproduce the issue. All other tests which reproduced the issue were executed from instances in my private subnets which use VPC NAT gateway when making requests to the internet.

I don't have a lot of experience with iptables and envoy and I'm not sure if the image which I created to test directly on EC2 instances contains the correct config which Istio generates, but if it can help investigating, here are the steps to reproduce using the custom docker image:

docker run -ti --cap-add=NET_ADMIN tmilanov/istio-socket-hang-up-troubleshoot:0.1.0 bash
./iptables.sh
./start-envoy.sh
node repro.js

My workaround is to use traffic.sidecar.istio.io/excludeOutboundPorts and exclude port 443 but I'd appreciate if someone with more experience could take a look at this issue.

Version

istioctl version:
client version: 1.11.3
control plane version: 1.12.1
data plane version: 1.12.1 (5 proxies)

kubectl version --short:
Client Version: v1.20.0
Server Version: v1.20.7-eks-d88609

Additional Information

No response

The text was updated successfully, but these errors were encountered:

bianpengyuan · 2021-12-14T06:35:52Z

Collecting trace level envoy log when this happen would help to find out the issue: https://github.com/istio/istio/wiki/Troubleshooting-Istio#collecting-information-2

TsvetanMilanov · 2021-12-14T09:17:59Z

Here is the full trace log started before the first request and stopped after the second request fails -> https://gist.github.com/TsvetanMilanov/8958a747aa19a3f06d35bf791f3ebaf3
The first request starts at 08:54:10 UTC
The second request starts at 09:00:10 UTC
The destination is https://ipv4.icanhazip.com (104.18.114.97)

I noticed the following error read error: Connection reset by peer but I'm not sure if peer is the VPC NAT or the ipv4.icanhazip.com server, or iptables, or something else.

Here is the clusters config -> https://gist.github.com/TsvetanMilanov/3304f63bd37c2655cbfe11974cd8b378
Here are the stats after the second request fails -> https://gist.github.com/TsvetanMilanov/0d21f10091a7b8892fb2dae72224a309

bianpengyuan · 2021-12-14T23:04:40Z

C52 is the connection from envoy to upstream (104.18.114.97), and looks like it probably gets some sort of idle time out (I don't see any activity between 08:54 and 09:00)? So probably AWS VPC NAT gateway has some configuration for that? Not sure why keep alive is not kicking in.

TsvetanMilanov · 2021-12-15T10:38:54Z

I found the following in the AWS docs:

If a connection that's using a NAT gateway is idle for 350 seconds or more, the connection times out.

When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).

To prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can enable TCP keepalive on the instance with a value less than 350 seconds.

This seems to cause the issue, but somehow it doesn't affect http requests or requests which go directly to the NAT.

bianpengyuan · 2021-12-15T18:22:27Z

Just learned from @lambdai that keepalive actually won't be passed through envoy. It is kernel to kernel for TCP. You will need to configure destination rule for your external service: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-TCPSettings, or do it mesh wide: https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig

TsvetanMilanov · 2021-12-15T21:49:11Z

Creating ServiceEntry + DestinationRule fixed the issue. Adding tcpKeepalive in the meshConfig didn't help.
Unfortunately creating service entry and destination rule for each external service and port won't work in my case, because the external service and port are not static, they are provided by the users of the application.

bianpengyuan · 2021-12-16T07:33:20Z

Creating ServiceEntry + DestinationRule fixed the issue. Adding tcpKeepalive in the meshConfig didn't help. Unfortunately creating service entry and destination rule for each external service and port won't work in my case, because the external service and port are not static, they are provided by the users of the application.

This is actually a bug, will be fixed by #36532

TsvetanMilanov · 2021-12-17T15:23:03Z

I tested with 1.13-alpha.90282c9ff5c1c7dfae2c9f15d2a38bcd15b9aa2a and setting tcpKeepalive in the meshConfig works now and fixed the issue.

johnzheng1975 · 2022-02-22T03:20:01Z

Excellent investigation and testing. @TsvetanMilanov
Excellent fix. @bianpengyuan

istio-policy-bot added the area/networking label Dec 13, 2021

bianpengyuan mentioned this issue Dec 16, 2021

Fix mesh tcp keepalive setting. #36532

Merged

istio-testing closed this as completed in #36532 Dec 16, 2021

johnzheng1975 mentioned this issue Feb 22, 2022

app report istio-proxy close connection unexpectly, and raise socket hang up #37432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket hang up in AWS VPC #36499

Socket hang up in AWS VPC #36499

TsvetanMilanov commented Dec 13, 2021 •

edited by istio-policy-bot

Loading

bianpengyuan commented Dec 14, 2021

TsvetanMilanov commented Dec 14, 2021

bianpengyuan commented Dec 14, 2021

TsvetanMilanov commented Dec 15, 2021

bianpengyuan commented Dec 15, 2021 •

edited

Loading

TsvetanMilanov commented Dec 15, 2021

bianpengyuan commented Dec 16, 2021

TsvetanMilanov commented Dec 17, 2021

johnzheng1975 commented Feb 22, 2022

Socket hang up in AWS VPC #36499

Socket hang up in AWS VPC #36499

Comments

TsvetanMilanov commented Dec 13, 2021 • edited by istio-policy-bot Loading

Bug Description

Version

Additional Information

bianpengyuan commented Dec 14, 2021

TsvetanMilanov commented Dec 14, 2021

bianpengyuan commented Dec 14, 2021

TsvetanMilanov commented Dec 15, 2021

bianpengyuan commented Dec 15, 2021 • edited Loading

TsvetanMilanov commented Dec 15, 2021

bianpengyuan commented Dec 16, 2021

TsvetanMilanov commented Dec 17, 2021

johnzheng1975 commented Feb 22, 2022

TsvetanMilanov commented Dec 13, 2021 •

edited by istio-policy-bot

Loading

bianpengyuan commented Dec 15, 2021 •

edited

Loading