-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket hang up in AWS VPC #36499
Comments
Collecting trace level envoy log when this happen would help to find out the issue: https://github.com/istio/istio/wiki/Troubleshooting-Istio#collecting-information-2 |
Here is the full trace log started before the first request and stopped after the second request fails -> https://gist.github.com/TsvetanMilanov/8958a747aa19a3f06d35bf791f3ebaf3 I noticed the following error Here is the clusters config -> https://gist.github.com/TsvetanMilanov/3304f63bd37c2655cbfe11974cd8b378 |
|
I found the following in the AWS docs:
This seems to cause the issue, but somehow it doesn't affect http requests or requests which go directly to the NAT. |
Just learned from @lambdai that keepalive actually won't be passed through envoy. It is kernel to kernel for TCP. You will need to configure destination rule for your external service: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-TCPSettings, or do it mesh wide: https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig |
Creating ServiceEntry + DestinationRule fixed the issue. Adding tcpKeepalive in the meshConfig didn't help. |
This is actually a bug, will be fixed by #36532 |
I tested with 1.13-alpha.90282c9ff5c1c7dfae2c9f15d2a38bcd15b9aa2a and setting tcpKeepalive in the meshConfig works now and fixed the issue. |
Excellent investigation and testing. @TsvetanMilanov |
Bug Description
I have a Node.js application deployed in an AWS EKS cluster and after deploying Istio in the cluster, I noticed sporadic
socket hang up
errors when making outgoing https requests from the Node.js application and keep alive is enabled in the https.Agent for those requests. There is no issue with http requests. The application is registered in Istio and all outgoing traffic is going through the istio-proxy.I managed to create a very simple script which reproduces the issue consistently:
Steps to reproduce:
I tested with the steps above in AWS EKS and in Kubernetes cluster deployed locally using Kind. I reproduced the issue in AWS, but I was not able to reproduce it in the local Kubernetes cluster.
I wrote the same script in Golang to test if the problem is in Node.js and I reproduced the issue with the Golang script in AWS (you can run the Golang script by running
kubectl exec $(kubectl get po | grep istio-repro | awk '{print $1}') -c repro -- go run repro.go
)I tested on EC2 instances with different Linux-based operating systems (Bottlerocket and Ubuntu) using a custom docker image -
tmilanov/istio-socket-hang-up-troubleshoot:0.1.0
. The image contains the scripts for reproducing the issue, envoy with config which only contains the PassthroughCluster and BlackHoleCluster config which Istio generates, iptables script which applies the output iptables config which Istio applies in the istio-init container. I reproduced the issue on both operating systems with this image.I also tested on EC2 instances deployed in the public subnets of my VPC and I did not reproduce the issue. All other tests which reproduced the issue were executed from instances in my private subnets which use VPC NAT gateway when making requests to the internet.
I don't have a lot of experience with iptables and envoy and I'm not sure if the image which I created to test directly on EC2 instances contains the correct config which Istio generates, but if it can help investigating, here are the steps to reproduce using the custom docker image:
docker run -ti --cap-add=NET_ADMIN tmilanov/istio-socket-hang-up-troubleshoot:0.1.0 bash
./iptables.sh
./start-envoy.sh
node repro.js
My workaround is to use
traffic.sidecar.istio.io/excludeOutboundPorts
and exclude port 443 but I'd appreciate if someone with more experience could take a look at this issue.Version
Additional Information
No response
The text was updated successfully, but these errors were encountered: