Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent broken pipe #24387

Closed
nabbott2008 opened this issue Jun 3, 2020 · 24 comments
Closed

Intermittent broken pipe #24387

nabbott2008 opened this issue Jun 3, 2020 · 24 comments
Labels
area/networking kind/need more info Need more info or followup from the issue reporter lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@nabbott2008
Copy link

nabbott2008 commented Jun 3, 2020

Bug description
We are seeing
[Envoy (Epoch 0)] [2020-05-27 20:35:08.309][32][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:452] [C1340] idle timeout
in istio-proxy logs for TCP connections.
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
As this is a TCP connection, we shouldn't be seeing these 60m timeouts from what I can see in the documentation - it looks as though the connection is being treat as a HTTP connection.
We upgraded recently from Istio 1.3.6 and we weren't seeing these issues.
Steps to reproduce the bug
Here is the service we're testing against:

kind: Service
metadata:
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  creationTimestamp: "2020-05-26T07:08:07Z"
  labels:
    app: abc-opensource
    chart: abc-opensource-0.29.0
    heritage: Tiller
    product: redis
    release: abc-redis
  name: abc-redis-opensource-announce-0
  namespace: abc
  resourceVersion: "29305795"
  selfLink: /api/v1/namespaces/abc/services/abc-redis-opensource-announce-0
  uid: e3629883-79fe-4304-8cb1-9b544ed152ad
spec:
  clusterIP:  1.2.3.4
  ports:
  - name: tcp-server
    port: 6379
    protocol: TCP
    targetPort: redis
  - name: tcp-sentinel
    port: 26379
    protocol: TCP
    targetPort: sentinel
  publishNotReadyAddresses: true
  selector:
    app: abc-opensource
    release: abc-redis
    statefulset.kubernetes.io/pod-name: abc-redis-opensource-server-0
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

client version: 1.5.4
egressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway-public version:
pilot version: 1.5.4
data plane version: 1.5.1 (32 proxies), 1.5.4 (173 proxies)
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Client: &version.Version{SemVer:"v2.16.7", GitCommit:"5f2584fd3d35552c4af26036f0c464191287986b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.6", GitCommit:"dd2e5695da88625b190e6b22e9542550ab503a47", GitTreeState:"clean"}

The client uses jredis. We see the issue where the client is disconnected from sentinel:
Lost connection to Sentinel at abc-redis-opensource-announce-0:26379. Sleeping 5000ms and retrying.
The issue that is causing us most pain though, is when we ask redis client to make a new request. It seems jredis creates a connection to redis when it starts and then this connection is timed out by istio. This means when any of our apps make a new request to jredis, it doesn't have a connection to redis established and needs to create a new one. We see

redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Broken pipe (Write failed)","rootCause":"Broken pipe (Write failed)"}]}}

We are seeing similar issues between other apps but finding that harder to replicate.
We weren't seeing these issues on 1.3.6 and we're still not seeing these issues on other clusters running that version
How was Istio installed?
Using operator.
Environment where bug was observed (cloud vendor, OS, etc)
EKS

@christian-posta
Copy link
Contributor

Could you paste the json output for the listener in question? you can use istioctl proxy-config listeners to get that

@nabbott2008
Copy link
Author

nabbott2008 commented Jun 25, 2020

    {
        "name": "10.100.230.235_26379",
        "address": {
            "socketAddress": {
                "address": "10.100.230.235",
                "portValue": 26379
            }
        },
        "filterChains": [
            {
                "filters": [
                    {
                        "name": "istio.stats",
                        "typedConfig": {
                            "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
                            "typeUrl": "type.googleapis.com/envoy.extensions.filters.network.wasm.v3.Wasm",
                            "value": {
                                "config": {
                                    "configuration": "{\n  \"debug\": \"false\",\n  \"stat_prefix\": \"istio\"\n}\n",
                                    "root_id": "stats_outbound",
                                    "vm_config": {
                                        "code": {
                                            "local": {
                                                "inline_string": "envoy.wasm.stats"
                                            }
                                        },
                                        "runtime": "envoy.wasm.runtime.null",
                                        "vm_id": "tcp_stats_outbound"
                                    }
                                }
                            }
                        }
                    },
                    {
                        "name": "envoy.tcp_proxy",
                        "typedConfig": {
                            "@type": "type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy",
                            "statPrefix": "outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local",
                            "cluster": "outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local",
                            "accessLog": [
                                {
                                    "name": "envoy.file_access_log",
                                    "typedConfig": {
                                        "@type": "type.googleapis.com/envoy.config.accesslog.v2.FileAccessLog",
                                        "path": "/dev/stdout",
                                        "format": "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% \"%DYNAMIC_METADATA(istio.mixer:status)%\" \"%UPSTREAM_TRANSPORT_FAILURE_REASON%\" %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME% %ROUTE_NAME%\n"
[
    {
        "name": "10.100.230.235_26379",
        "address": {
            "socketAddress": {
                "address": "10.100.230.235",
                "portValue": 26379
            }
        },
        "filterChains": [
            {
                "filters": [
                    {
                        "name": "istio.stats",
                        "typedConfig": {
                            "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
                            "typeUrl": "type.googleapis.com/envoy.extensions.filters.network.wasm.v3.Wasm",
                            "value": {
                                "config": {
                                    "configuration": "{\n  \"debug\": \"false\",\n  \"stat_prefix\": \"istio\"\n}\n",
                                    "root_id": "stats_outbound",
                                    "vm_config": {
                                        "code": {
                                            "local": {
                                                "inline_string": "envoy.wasm.stats"
                                            }
                                        },
                                        "runtime": "envoy.wasm.runtime.null",
                                        "vm_id": "tcp_stats_outbound"
                                    }
                                }
                            }
                        }
                    },
                    {
                        "name": "envoy.tcp_proxy",
                        "typedConfig": {
                            "@type": "type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy",
                            "statPrefix": "outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local",
                            "cluster": "outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local",
                            "accessLog": [
                                {
                                    "name": "envoy.file_access_log",
                                    "typedConfig": {
                                        "@type": "type.googleapis.com/envoy.config.accesslog.v2.FileAccessLog",
                                        "path": "/dev/stdout",
                                        "format": "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% \"%DYNAMIC_METADATA(istio.mixer:status)%\" \"%UPSTREAM_TRANSPORT_FAILURE_REASON%\" %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME% %ROUTE_NAME%\n"
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        ],
        "deprecatedV1": {
            "bindToPort": false
        },
        "trafficDirection": "OUTBOUND"
    }
]

apologies, I made a mistake in logging the issue. What we're actually seeing is this tcp timout

2020-06-23 14:32:55.339 [Envoy (Epoch 0)] [2020-06-23 14:32:55.280][35][debug][conn_handler] [external/envoy/source/server/connection_handler_impl.cc:353] [C403] new connection
2020-06-23 14:32:55.339 [Envoy (Epoch 0)] [2020-06-23 14:32:55.280][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:381] [C403] Creating connection to cluster outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local
2020-06-23 14:32:55.339 [Envoy (Epoch 0)] [2020-06-23 14:32:55.280][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:232] [C403] new tcp proxy session
2020-06-23 15:32:56.133 [Envoy (Epoch 0)] [2020-06-23 15:32:55.284][35][debug][conn_handler] [external/envoy/source/server/connection_handler_impl.cc:86] [C403] adding to cleanup list
2020-06-23 15:32:56.133 [Envoy (Epoch 0)] [2020-06-23 15:32:55.284][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C403] closing socket: 1
2020-06-23 15:32:56.133 [Envoy (Epoch 0)] [2020-06-23 15:32:55.284][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C403] closing data_to_write=0 type=1
2020-06-23 15:32:56.133 [Envoy (Epoch 0)] [2020-06-23 15:32:55.284][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:597] [C403] Session timed out
and we can't find out how to change this hourly timeout - can't see in documentation how to do this

@christian-posta
Copy link
Contributor

You can try setting ISTIO_META_IDLE_TIMEOUT env variable to a duration (ie, 24h0m0s) on the Istio-proxy for this workload and that should do the trick. Let me know how that works for you.

@nabbott2008
Copy link
Author

nabbott2008 commented Jun 29, 2020

Hi,

Thanks for the response. I added the following annotation:

        proxy.istio.io/config: |-
          proxyMetadata:
            ISTIO_META_IDLE_TIMEOUT: 24h0m0s

I tried adding this to the client and server in this scenario and it didn't stop the timeouts.

We have also tried to add this envoyfilter

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  creationTimestamp: "2020-06-29T09:24:30Z"
  generation: 1
  name: redis-test2
  namespace: abc
  resourceVersion: "54665918"
  selfLink: /apis/networking.istio.io/v1alpha3/namespaces/rsb/envoyfilters/redis-test2
  uid: e7e60a2d-92b1-4b07-93ac-f5677ecbd751
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: SIDECAR_OUTBOUND
      listener:
        filterChain:
          filter:
            name: envoy.tcp_proxy
        portNumber: 6379
    patch:
      operation: MERGE
      value:
        name: envoy.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          idle_timeout: 24h
  workloadSelector:
    labels:
      app: app

Can see this added to config:

kubectl -n rsb get envoyfilter redis-test -o yaml
---
---
---
                        "name": "envoy.tcp_proxy",
                        "typedConfig": {
                            "@type": "type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy",
                            "statPrefix": "outbound|6379||abc-redis-opensource.abc.svc.cluster.local",
                            "cluster": "outbound|6379||abc-redis-opensource.abc.svc.cluster.local",
                            "idleTimeout": "86400s"
                        }
---
---
---

However, still seeing the hourly timeouts. (if I do set above to 60 seconds, I do get 60 second timeouts so think this is a timeout on the front end rather than the one we're having issues with)
I think I need to try and set the timeout on the cluster config but I'm having difficulties doing that at the moment.

@christian-posta
Copy link
Contributor

Yah I'm not sure why the annotation doesn't work (I'll check) but just add it as an env variable like I previously mentioned... add it to the client side proxy:

ISTIO_META_IDLE_TIMEOUT
24h0m0s

@nabbott2008
Copy link
Author

kubectl -n abc exec -it app-f6f79bcf6-ldjgb -c istio-proxy -- env | grep ISTIO_META_IDLE_TIMEOUT
ISTIO_META_IDLE_TIMEOUT=24h0m0s

It adds correctly using the annotation but, unfortunately, don't see any change to the timeout.

@christian-posta
Copy link
Contributor

Ah okay let me check the cluster config part. There may not be a way to do that at the moment. Will report back.

@nabbott2008
Copy link
Author

Thanks. N.B. We are using Istio 1.6.2 now.

@christian-posta
Copy link
Contributor

christian-posta commented Jul 2, 2020

I gotta be honest, I'm not able to reproduce this. I'm getting the idle_timeout enforced by the tcp_proxy correctly. Now.. if you do have filters on the cluster as well, we should take a look at those... can you give me the json output of this cluster outbound|26379||dds-redis-opensource-announce-0.abc.svc.cluster.local for the proxy on the client side of the connection?

Worst case, I can jump on a zoom call with you and we can take a look together.

@howardjohn howardjohn added the kind/need more info Need more info or followup from the issue reporter label Jul 29, 2020
@courcelm
Copy link

courcelm commented Sep 1, 2020

I'm having a similar issue on Istio 1.5.8. TCP connections are dropped for various service types (SSL, postgres, redis) after 1 hour. Like @nabbott2008 , settings ISTIO_META_IDLE_TIMEOUT to 24h0m0s did not work but 30s did work.

@jmp0xf
Copy link

jmp0xf commented Sep 6, 2020

+1, We are seeing exactly the same issue as @courcelm described, on Istio 1.5.6.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Oct 1, 2020
@GregHanson
Copy link
Member

I know idle_timeout appears in HTTP connection manager common_http_protocol_options, route, and cluster

Using an EnvoyFilter to set the cluster level idle_timeout to 10m, users were able to trigger closed connection failure in their app, but unable to increase the duration beyond the 1 hour default:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: "cluster-idletimeout"
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      context: ANY
    patch:
      operation: MERGE
      value:
        common_http_protocol_options:
          idle_timeout: 10m      

Envoy defaults all idle_timeout's to 1 hour, @howardjohn @christian-posta is the problem that users need to change the value in all places if they want it to go longer? i.e. does a smaller default idle_timeout at the Listener level break the larger user specified idle_timeout configured at the Cluster level?

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Oct 6, 2020
@ryant1986
Copy link

ryant1986 commented Oct 21, 2020

We ran into the same problem on 1.7, but we noticed that the ISTIO_META_IDLE_TIMEOUT setting was only getting picked up on the OUTBOUND side of things, not the INBOUND. By adding an additional filter that applied to the INBOUND side of the request, we were able to successfully increase the timeout (we used 24 hours)

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: listener-timeout-tcp
  namespace: istio-system
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: SIDECAR_INBOUND
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.tcp_proxy
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          idle_timeout: 24h

We also created a similar filter to apply to the passthrough cluster (so that timeouts still apply to external traffic that we don't have service entries for), since the config wasn't being picked up there either.

@jsabalos
Copy link

I'm having this 1hr timeout issue also with Postgres client. Using istio1.6.7

@Rayzhangtian
Copy link

env ISTIO_META_IDLE_TIMEOUT is only for outbound, so it is good for ingress gateway, but not for Pod annotation. For service client, we shall use DR + KA configuration.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jan 5, 2021
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jan 20, 2021
@pslobo
Copy link

pslobo commented May 17, 2021

@Rayzhangtian We're having issues with this and connections to postgres (similar to @jsabalos). Could you please share what DR + KA configs you used that solved the issue?

@zufardhiyaulhaq
Copy link
Member

I think increasing or disabling idle_timeout is only a workaround to fix this. The question is why envoy sidecar doesn't terminate the connection to application (downstream) and only terminate to server upstream

@istio-policy-bot istio-policy-bot removed lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while labels Nov 2, 2021
@zufardhiyaulhaq
Copy link
Member

zufardhiyaulhaq commented Nov 2, 2021

here is my topology

user -> gateway -> service (sidecar istio-proxy) -> redis

I am reducing TCP proxy idle timeout to simulate the problem. It's doesn't matter if it's 3s, 1h, 24h, or 7d if the client is idle more than the threshold we set, it's causing broken pipe:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: idle-timeout-1.9-ef
  namespace: istio-system
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: ANY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.tcp_proxy
      proxy:
        proxyVersion: ^1\.9.*
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
          idle_timeout: 3s

the first request success, it's creating TCP proxy

2021-11-02T05:14:13.260048Z	debug	envoy pool	[C47163] using existing connection
2021-11-02T05:14:13.260053Z	debug	envoy pool	[C47163] creating stream
2021-11-02T05:14:13.260066Z	debug	envoy router	[C47317][S9607723458839045519] pool ready
2021-11-02T05:14:13.260487Z	debug	envoy filter	original_dst: New connection accepted
2021-11-02T05:14:13.260556Z	debug	envoy filter	[C47318] new tcp proxy session
2021-11-02T05:14:13.260577Z	debug	envoy filter	[C47318] Creating connection to cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260608Z	debug	envoy upstream	transport socket match, socket default selected for host with address 10.121.211.209:6379
2021-11-02T05:14:13.260621Z	debug	envoy upstream	Created host 10.121.211.209:6379.
2021-11-02T05:14:13.260661Z	debug	envoy upstream	addHost() adding 10.121.211.209:6379
2021-11-02T05:14:13.260667Z	debug	envoy pool	creating a new connection
2021-11-02T05:14:13.260759Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260772Z	debug	envoy pool	[C47319] connecting
2021-11-02T05:14:13.260777Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260762Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260778Z	debug	envoy connection	[C47319] connecting to 10.121.211.209:6379
2021-11-02T05:14:13.260800Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.260857Z	debug	envoy connection	[C47319] connection in progress
2021-11-02T05:14:13.260871Z	debug	envoy pool	queueing request due to no available connections
2021-11-02T05:14:13.260876Z	debug	envoy conn_handler	[C47318] new connection
2021-11-02T05:14:13.260896Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:13.260907Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:13.262126Z	debug	envoy connection	[C47319] connected

after 3 seconds of the idle timeout, it seems like closing the connection, but only to upstream redis

2021-11-02T05:14:16.262443Z	debug	envoy filter	[C47318] Session timed out
2021-11-02T05:14:16.262483Z	debug	envoy connection	[C47318] closing data_to_write=0 type=1
2021-11-02T05:14:16.262486Z	debug	envoy connection	[C47318] closing socket: 1
2021-11-02T05:14:16.262648Z	debug	envoy connection	[C47319] closing data_to_write=0 type=1
2021-11-02T05:14:16.262660Z	debug	envoy connection	[C47319] closing socket: 1
2021-11-02T05:14:16.262705Z	debug	envoy pool	[C47319] client disconnected
2021-11-02T05:14:16.262717Z	debug	envoy conn_handler	[C47318] adding to cleanup list
2021-11-02T05:14:16.262737Z	debug	envoy pool	[C47319] connection destroyed

and the second request, it's getting 502

2021-11-02T05:14:20.771115Z	debug	envoy pool	[C47171] using existing connection
2021-11-02T05:14:20.771120Z	debug	envoy pool	[C47171] creating stream
2021-11-02T05:14:20.771131Z	debug	envoy router	[C47187][S14376139327146836108] pool ready
2021-11-02T05:14:20.771559Z	debug	envoy router	[C47187][S14376139327146836108] upstream headers complete: end_stream=false
2021-11-02T05:14:20.771644Z	debug	envoy http	[C47187][S14376139327146836108] encoding headers via codec (end_stream=false):
':status', '502'
'date', 'Tue, 02 Nov 2021 05:14:20 GMT'
'content-length', '3'
'content-type', 'text/plain; charset=utf-8'
'x-envoy-upstream-service-time', '0'
'x-envoy-peer-metadata', '[REDACTED]'
'x-envoy-peer-metadata-id', 'sidecar~10.122.192.149~echo-redis-stable-5987d8dd44-kwxw8.testing~testing.svc.cluster.local'
'server', 'istio-envoy'

2021-11-02T05:14:20.771675Z	debug	envoy client	[C47171] response complete
2021-11-02T05:14:20.772082Z	debug	envoy filter	Called AuthenticationFilter : onDestroy
2021-11-02T05:14:20.772110Z	debug	envoy pool	[C47171] response complete
2021-11-02T05:14:20.772116Z	debug	envoy pool	[C47171] destroying stream: 0 remaining

and somehow, after the second request is failing, the third request success to create the TCP proxy

2021-11-02T05:14:22.059301Z	debug	envoy pool	[C47171] using existing connection
2021-11-02T05:14:22.059305Z	debug	envoy pool	[C47171] creating stream
2021-11-02T05:14:22.059314Z	debug	envoy router	[C47170][S952113864182611469] pool ready
2021-11-02T05:14:22.059671Z	debug	envoy filter	original_dst: New connection accepted
2021-11-02T05:14:22.059743Z	debug	envoy filter	[C47327] new tcp proxy session
2021-11-02T05:14:22.059799Z	debug	envoy filter	[C47327] Creating connection to cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.059828Z	debug	envoy upstream	transport socket match, socket default selected for host with address 10.121.211.209:6379
2021-11-02T05:14:22.059835Z	debug	envoy upstream	Created host 10.121.211.209:6379.
2021-11-02T05:14:22.059858Z	debug	envoy pool	creating a new connection
2021-11-02T05:14:22.059886Z	debug	envoy upstream	addHost() adding 10.121.211.209:6379
2021-11-02T05:14:22.059905Z	debug	envoy pool	[C47328] connecting
2021-11-02T05:14:22.059914Z	debug	envoy connection	[C47328] connecting to 10.121.211.209:6379
2021-11-02T05:14:22.059973Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.059986Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.060000Z	debug	envoy connection	[C47328] connection in progress
2021-11-02T05:14:22.060012Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.059988Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.060014Z	debug	envoy pool	queueing request due to no available connections
2021-11-02T05:14:22.060059Z	debug	envoy conn_handler	[C47327] new connection
2021-11-02T05:14:22.060089Z	debug	envoy upstream	membership update for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal. added 1 removed 0
2021-11-02T05:14:22.060104Z	debug	envoy upstream	re-creating local LB for TLS cluster outbound|6379||s-echo-redis-cache-redis-01.c.staging.internal.
2021-11-02T05:14:22.061287Z	debug	envoy connection	[C47328] connected
2021-11-02T05:14:22.061311Z	debug	envoy pool	[C47328] assigning connection
2021-11-02T05:14:22.061328Z	debug	envoy filter	[C47327] TCP:onUpstreamEvent(), requestedServerName: 
2021-11-02T05:14:22.062214Z	debug	envoy router	[C47170][S952113864182611469] upstream headers complete: end_stream=false
2021-11-02T05:14:22.062284Z	debug	envoy http	[C47170][S952113864182611469] encoding headers via codec (end_stream=false):
':status', '200'

you can see the full log (service sidecar) here:
echo-redis-03.log

I have concern why the envoy sidecar is only terminating to upstream (in this case Redis), but not terminating to downstream itself (the application where the sidecar is running)

This causes us a problem since we need to configure all our applications to set TCP idle timeout below the Envoy configuration. For example, if we have a 1-hour idle timeout (default Envoy TCP proxy idle timeout), we must make sure our application that connects to Redis/Postgresql or whatever that is using TCP proxy to have an idle timeout below 1 hour so our application can initiate a TCP closing handshake. We receive lots of complaints from devs.

FYI:

  • this issue is not only on Redis, but also on PostgreSQL, or any connection that is using TCP
  • If your client keeps sending keepalive to upstream TCP, this issue will not occur since it's not idle (our MongoDB client keep sending traffic to upstream MongoDB and don't see this problem)

Istio version: v1.9.9

@howardjohn do you know about this? did we miss something?

@zufardhiyaulhaq
Copy link
Member

zufardhiyaulhaq commented Nov 2, 2021

give diagram that might help to understand the issue

Untitled Diagram drawio

there is two expected behavior that could solve this problem:

  1. If idle timeout is above the threshold, sidecar close connection to upstream (Redis, Postgresql, etc) but sidecar must open a new connection to the upstream
  2. or, when idle timeout is above the threshold, the sidecar is also closing the connection to the downstream client

@zufardhiyaulhaq
Copy link
Member

We try to implement TCP keepalive, but this doesn't block TCP Proxy idle_timeout for terminating the connection to upstream (redis in this case)

image

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: idle-timeout-1.9-ef
  namespace: istio-system
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: ANY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.tcp_proxy
      proxy:
        proxyVersion: ^1\.9.*
    patch:
      operation: MERGE
      value:
        name: envoy.filters.network.tcp_proxy
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
          idle_timeout: 20s
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: echo-redis
  namespace: testing
spec:
  hosts:
  - s-echo-redis-cache-redis-01.c.staging.internal.
  addresses:
  - 10.121.211.209
  location: MESH_EXTERNAL
  ports:
  - number: 6379
    name: tcp
    protocol: TCP
  resolution: NONE
  exportTo:
  - "."
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: echo-redis-mesh-dr
  namespace: testing
spec:
  host: s-echo-redis-cache-redis-01.c.staging.internal.
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 6379
      connectionPool:
        tcp:
          tcpKeepalive:
            probes: 9
            time: 5s
            interval: 3s

@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-11-02. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@batchamalick
Copy link

@ryant1986 could you share the Envoyfilter for passthrough cluster sample here.

@devsh4
Copy link

devsh4 commented Jan 11, 2024

For anyone still facing broken pipe errors due to upstream closing the connection, based on this comment we ended up using an EnvoyFilter but in our case setting the idleTimeout to 30s did the trick and not 24h. Further the context was set to GATEWAY and not SIDECAR_INBOUND.

We are on v.1.19.5.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: listener-timeout-tcp
  namespace: istio-system
spec:
  configPatches:
    - applyTo: NETWORK_FILTER
      match:
        context: GATEWAY
        listener:
          filterChain:
            filter:
              name: envoy.filters.network.tcp_proxy
      patch:
        operation: MERGE
        value:
          name: envoy.filters.network.tcp_proxy
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
            idle_timeout: 30s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/need more info Need more info or followup from the issue reporter lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests