remove send timeout #48543

ramaraochavali · 2023-12-25T08:45:18Z

Send timeout is defaulted to 0s for many releases and is stable. Setting send timeout may cause unintended consequences like stream getting blocked etc.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

istio-testing · 2023-12-25T08:45:21Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ramaraochavali · 2023-12-25T08:51:18Z

/test release-notes

wulianglongrd · 2023-12-25T13:04:43Z

pilot/pkg/xds/ads.go

 		start := time.Now()
 		defer func() { recordSendTime(time.Since(start)) }()
 		return conn.stream.Send(res)
 	}
-	err := istiogrpc.Send(conn.stream.Context(), sendHandler)
+	err := sendResponse()


I'm curious why this change can solve the issue mentioned in the PR description. When PILOT_XDS_SEND_TIMEOUT==0, it seems that the behavior has not changed.

PILOT_XDS_SEND_TIMEOUT is set to 20s when that issue happened. It was a legacy value and later we changed to 0s

Got it, Thanks.

hzxuzhonghu · 2023-12-26T01:57:09Z

I am not sure this is the cause of https://github.com/istio/istio/issues/48517， can you clarify

ramaraochavali · 2023-12-26T02:22:51Z

I am not sure this is the cause of #48517 can you clarify

It is not 100% certain unless we test for that specific case. I have requested to test without timeout. But with Send Timeout we have seen such blocking issues happen #31943

shonecyx · 2023-12-26T06:46:07Z

I am not sure this is the cause of #48517 can you clarify

It is not 100% certain unless we test for that specific case. I have requested to test without timeout. But with Send Timeout we have seen such blocking issues happen #31943

FYI comment in the original issue #48517 (comment)
We cannot consistently reproduce this issue, actually it's hard to reproduce.

shonecyx · 2023-12-27T02:35:51Z

I think this would help. When the CDS got stuck, the envoy main thread spike and igw is responding very slow and xds_proxy didn't get chance to have a successful downstream push to envoy within the timeout, so it will get stuck here and result to connection close for all pushes. @hzxuzhonghu @ramaraochavali

BTW is it possible to write some test cases for that?

ramaraochavali · 2023-12-27T03:22:59Z

I also think this should help

BTW is it possible to write some test cases for that?

It is difficult to simulate this

hzxuzhonghu · 2023-12-27T03:35:18Z

From system stability view, send timeout is necessary to prevent a thread blocking forever. It is by default no timeout, i think it is no harm leaving it there.

hzxuzhonghu · 2023-12-27T03:38:20Z

Another example is http timeout, mostly we have a timeout setting, we want to reduce the influence to client side regardless of the other side's uncertainty.

ramaraochavali · 2023-12-27T03:54:59Z

#31943 (comment) - See comments from here. We have purposefully disabled timeouts (Even Envoy does not have - why should xds proxy has?) and left this var as a short term gap. Please see recommendation from gRPC as well #31943 (comment)

hzxuzhonghu · 2023-12-27T06:28:00Z

I have not read carefully the grpc xds implementation, but there is a timeout here https://github.com/grpc/grpc-go/blob/4f03f3ff32c983f2e9b030889041ff9d5ffb6aeb/xds/internal/xdsclient/singleton.go#L90

	// Create the new client implementation.
	c, err := newWithConfig(config, defaultWatchExpiryTimeout, defaultIdleAuthorityDeleteTimeout)
	if err != nil {
		return nil, nil, err
	}

shonecyx · 2023-12-27T08:00:08Z

I think the problem here is that do we want to enforce PILOT_XDS_SEND_TIMEOUT = 0s for both xds_proxy upstream and downstream, since the default value already changed to 0s since 1.14, while it was 5s default in 1.10 and we tuned to 20s(we were not aware of it changed to 0s when we upgraded to 1.17), but if we don't see benefit with none 0s time out and don't see impact like thread locking since 1.14 with 0s timeout, it would be helpful to adress such XDS push timeout issue.

hzxuzhonghu · 2023-12-27T11:15:07Z

It would defintely benefit pilot here if your one envoy need 100s to process one response. We have a max push concurrent number limit.

ramaraochavali · 2023-12-27T13:11:52Z

It would defintely benefit pilot here if your one envoy need 100s to process one response.

What would benefit pilot? Are you saying send timeout benefits?

ramaraochavali · 2023-12-27T13:17:06Z

I have not read carefully the grpc xds implementation, but there is a timeout here

I do not think there is a timeout. Based on what we have discussed here, we just need keep alives which we already have and do not need send timeouts.

shonecyx · 2023-12-28T03:21:16Z

For this issue #48517, remove time out might help but doubt it's the root cause, otherwise restart istiod won't help, it seems some dead lock triggered maybe due to the time out.

BTW if we agree on removing the time out, then we'd probably remove the the legacy flag PILOT_XDS_SEND_TIMEOUT as well, since leave it here with none zero value is conflicting with removing time out. As for the potential thread locking, it should be covered by keepalive and if it's not enough we might have other mechanism to avoid like just close the connection hourly.

shonecyx · 2023-12-28T07:46:10Z

Even today there is consistent time out in xds proxy downstream and upstream to varies of proxies, although it's not causing impact yet, that's probably due to the 20s PILOT_XDS_SEND_TIMEOUT, while configure the PILOT_XDS_SEND_TIMEOUT to 0s is equivalent to merge this PR.

We'll try to configure PILOT_XDS_SEND_TIMEOUT to 0s to see if it helps.

  |   | 2023-12-28 15:29:45.002 | 2023-12-28T07:29:45.143319Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:0/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 15:29:45.002 | 2023-12-28T07:29:45.143266Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-702411 |  
  |   | 2023-12-28 14:25:46.004 | 2023-12-28T06:25:46.343828Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:338/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 14:25:46.004 | 2023-12-28T06:25:46.343792Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-699595 |  
  |   | 2023-12-28 13:22:16.004 | 2023-12-28T05:22:16.244596Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:137/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 13:22:16.004 | 2023-12-28T05:22:16.244556Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-696880 |  
  |   | 2023-12-28 12:52:37.001 | 2023-12-28T04:52:37.310135Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [1861] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF	thread=31 |  
  |   | 2023-12-28 12:52:37.001 | 2023-12-28T04:52:37.309415Z	warn	xdsproxy	upstream [1861] terminated with unexpected error upstream [1861] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF |  
  |   | 2023-12-28 12:24:38.000 | 2023-12-28T04:24:38.344974Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:65/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 12:24:38.000 | 2023-12-28T04:24:38.344942Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-694356 |  
  |   | 2023-12-28 11:55:41.006 | 2023-12-28T03:55:41.144268Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:65/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 11:55:41.006 | 2023-12-28T03:55:41.144233Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-693163 |  
  |   | 2023-12-28 09:54:41.009 | 2023-12-28T01:54:41.245604Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:16/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 09:54:41.009 | 2023-12-28T01:54:41.245569Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-687729 |  
  |   | 2023-12-28 09:54:09.001 | 2023-12-28T01:54:09.991006Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [1855] send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF	thread=31 |  
  |   | 2023-12-28 09:54:09.001 | 2023-12-28T01:54:09.990451Z	warn	xdsproxy	upstream [1855] terminated with unexpected error upstream [1855] send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF

hzxuzhonghu · 2023-12-28T09:15:18Z

Check pilot_xds_send_time metric to see how slow it is. NO timeout or increasing it is just a workaround, why it is slow, can you resolve it accordingly

hzxuzhonghu · 2023-12-28T09:16:09Z

If you mean the cpu usage is very high, would increasing CPU limit help?

hzxuzhonghu · 2024-01-04T03:28:51Z

Will wait to check after macConnectionAge
Update: Istiod will close the underlying connection after maxConnectionAge, but keepalive does not in this test

ramaraochavali · 2024-01-04T13:52:14Z

Isn't keepalive intended to check whether the connection is broken or not? In your test it seems connection is live but at application you are blocking?

hzxuzhonghu · 2024-01-05T01:31:39Z

Yeah, right, this is common when a client is busy not consuming the response. For a typical HTTP server we usually tune this setting. In istio may be maxConnectionAge is enough as all the clients are under control. The worst effect is leave the goroutine and socket leak for the duration

ramaraochavali · 2024-01-05T02:56:36Z

@hzxuzhonghu if you do not have further concerns, can you please approve this?

ramaraochavali · 2024-01-06T04:40:35Z

@hzxuzhonghu gentle ping

hzxuzhonghu

I am a little bit concerned about this may cause blocking silently, in the case we do not even have a way to know. If the client is blocked, istiod can still send xds response as much as socket buffer.
Though non related, we have to face the bad design we have now, the distribution state report is totally wrong after sending called. Looks at reportAllEvents

hzxuzhonghu · 2024-01-07T02:16:36Z

pkg/istio-agent/xds_proxy.go

-	defer func() {
-		// This is a hint to help debug slow responses.
-		if time.Since(tStart) > 10*time.Second {
-			proxyLog.Warnf("sendDownstream took %v", time.Since(tStart))


@ramaraochavali This kind of info is helpful and no harm

ok. I can add that part back if agree on removing the timeout

ramaraochavali · 2024-01-08T07:05:01Z

I am a little bit concerned about this may cause blocking silently, in the case we do not even have a way to know. If the client is blocked, istiod can still send xds response as much as socket buffer.

With timeouts we ran in to various issues as described in #31943. not sure how it will practically help

hzxuzhonghu · 2024-01-08T11:23:09Z

Timeout is not the root cause of any issue. And not even fix the issue. You can choose not setting timeout, this is the default behavior of istio.

In the case envoy is overloaded, increase timeout does not help at all, istiod can also be blocked. In my opinion, we can make xds proxy work as a intermediate buffer to mitigate the speed gap between xds producer and xds consumer. By this, at least istiod is able to handle other xds requests.

In the worst case if the envoy is blocking undetermined, there is no way to recover at all. We should make control plane close the connection as soon. If envoy is unblocked or fast enough later, reconnect and resend xds request is also acceptable.

howardjohn · 2024-01-08T16:02:39Z

Timeout is not the root cause of any issue.

I am not sure how we can say this... We have had 5+ outages caused by the timeout that are extensively documented in #31943. These were also found in Istio integration tests. The root cause of that WAS the timeouts, and since we removed the timeout by default 3 years ago we have never seen any issues like this again. In fact, the only similar issue we have seen during that time was when someone enabled the timeout (#48517).

Currently, we have a setting that is:

(1) off by default
(2) has proven to be stable to not set it for 3 years
(3) is know to cause outages when enabled

Why would we keep it around?

In the case envoy is overloaded, increase timeout does not help at all, istiod can also be blocked. In my opinion, we can make xds proxy work as a intermediate buffer to mitigate the speed gap between xds producer and xds consumer. By this, at least istiod is able to handle other xds requests.
In the worst case if the envoy is blocking undetermined, there is no way to recover at all. We should make control plane close the connection as soon. If envoy is unblocked or fast enough later, reconnect and resend xds request is also acceptable.

IMO there is still an outstanding issue that a malicious client can hog the semaphore around pushes. But moving this unboundedness to xds proxy does not solve this at all, as that is also a client-side issue.

hzxuzhonghu · 2024-01-09T00:21:29Z

the influence is the least if move to sidecar side

howardjohn · 2024-01-09T00:56:24Z

the influence is the least if move to sidecar side

If we are worried about clients deadlocking istiod, then its not clear to me why fixing the agent solves the problem. We have at least 4 official supported xds clients - envoy, grpc, istioctl, and ztunnel. So we only solve 1/4 of the problem there?

hzxuzhonghu · 2024-01-09T02:09:21Z

Making server have determined behavior is always the good practice, it can mitigate a malicious client's attacking. If you look at all the kube apiserver's implement, it has set related timeout IdleTimeout.

	secureServer := &http.Server{
		Addr:           s.Listener.Addr().String(),
		Handler:        handler,
		MaxHeaderBytes: 1 << 20,
		TLSConfig:      tlsConfig,

		IdleTimeout:       90 * time.Second, // matches http.DefaultTransport keep-alive timeout
		ReadHeaderTimeout: 32 * time.Second, // just shy of requestTimeoutUpperBound
	}

And take a lookat this blog post https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, it recommend always set timeout, for grpc it is similar.

It's critical for an HTTP server exposed to the Internet to enforce timeouts on client connections. Otherwise very slow or disappearing clients might leak file descriptors and eventually result in something along the lines of:

ramaraochavali · 2024-01-09T02:17:54Z

It's critical for an HTTP server exposed to the Internet to enforce timeouts on client connections.

We do have that configured via MaxConnAge. What we are debating here is individual message timeout(not connection timeout) which was prone to create issues as documented in #31943 . Our experience of setting timeout is also same as what @howardjohn described

hzxuzhonghu · 2024-01-09T02:22:01Z

MaxConnAge is for load balance, it is far longer than we want, we want to depend on something like keepalive, however it could not disclose connection during my test

howardjohn · 2024-01-09T23:08:10Z

IdleTimeout and ReadHeaderTimeout are not correlated to this. This is about writing. The only place k8s sets WriteTimeout is with a 4 hour timeout, which is far beyond our MaxConn

pkg/kubelet/server/server.go
169:            WriteTimeout:   4 * 60 * time.Minute,
203:            WriteTimeout:   4 * 60 * time.Minute,

hzxuzhonghu · 2024-01-10T01:50:07Z

If you read https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, writeTimeout is not same as send timeout we have now. In k8s there are commands like kubectl logs exec, they can last very long.

shonecyx · 2024-01-10T12:08:21Z

IMHO if we agree on the xds_proxy PILOT_XDS_SEND_TIMEOUT is the root cause of #48517, which to my understanding is the incomplete pushes caused unexpected behaivor like CDS stuck, then we should fix this bug by removing the flag.

Meanwhile I think there is a valid concern of thread blocking due to slow pushes but that would still happen even without xds_proxy since there's no timeout in Envoy's send/recv. For this issue we can add metrics/log and worst case can just kill all blocking threads/connections but this is less likely to happen since the timeout has been removed for 3 years.

shonecyx · 2024-01-16T06:43:11Z

Could we reach agreement on the root cause of cluster stuck at warming which was caused by the timeout?

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali · 2024-01-18T10:56:05Z

Could we reach agreement on the root cause of #48517 which was caused by the timeout?

Me and @howardjohn are on the same page that timeout is a bad thing and can cause issues based on our experience but @hzxuzhonghu has other thoughts. If you can confirm removing timeout could help your scenario, It would really help.

Not sure how we can move forward with this - I still think irrespective whether it helps #48517 or not, send timeout is not the right thing to do (I just can't think of how users can set a value to this and what happens to the entire xds flow if we run in to timeouts)

ramaraochavali · 2024-02-08T15:44:02Z

@hzxuzhonghu While some of the points you brought about bad clients causing havoc is valid, I do not see how send timeout helps, unless I am missing some thing. Can we please merge this and think about future improvements?

hzxuzhonghu

Approve as we can rely on maxConnctionAge to close the connection even sending stuck

* remove send timeout Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix compile Signed-off-by: Rama Chavali <rama.rao@salesforce.com> --------- Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

remove send timeout

98b7e12

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Dec 25, 2023

ramaraochavali marked this pull request as ready for review December 25, 2023 08:45

ramaraochavali requested review from a team as code owners December 25, 2023 08:45

istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. labels Dec 25, 2023

ramaraochavali added the release-notes-none Indicates a PR that does not require release notes. label Dec 25, 2023

wulianglongrd reviewed Dec 25, 2023

View reviewed changes

shonecyx mentioned this pull request Dec 27, 2023

CDS/EDC stuck at warming causing Envoy main thread CPU spike #48517

Closed

2 tasks

resolve conflicts

bf0a117

hzxuzhonghu reviewed Jan 7, 2024

View reviewed changes

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 10, 2024

resolve conflicts

d8bfe87

istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 18, 2024

fix compile

be69188

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

hzxuzhonghu approved these changes Feb 17, 2024

View reviewed changes

istio-testing merged commit 36a2515 into istio:master Feb 17, 2024
28 checks passed

ramaraochavali deleted the fix/remove_xds_send branch February 18, 2024 06:11

flyingyang mentioned this pull request Mar 6, 2024

remove unused parameter PILOT_XDS_SEND_TIMEOUT #49755

Merged

remove send timeout #48543

remove send timeout #48543

Conversation

ramaraochavali commented Dec 25, 2023 • edited by istio-policy-bot

istio-testing commented Dec 25, 2023

ramaraochavali commented Dec 25, 2023

wulianglongrd Dec 25, 2023

Choose a reason for hiding this comment

ramaraochavali Dec 25, 2023

Choose a reason for hiding this comment

wulianglongrd Dec 25, 2023

Choose a reason for hiding this comment

hzxuzhonghu commented Dec 26, 2023

ramaraochavali commented Dec 26, 2023

shonecyx commented Dec 26, 2023

shonecyx commented Dec 27, 2023 • edited

ramaraochavali commented Dec 27, 2023

hzxuzhonghu commented Dec 27, 2023

hzxuzhonghu commented Dec 27, 2023

ramaraochavali commented Dec 27, 2023

hzxuzhonghu commented Dec 27, 2023

shonecyx commented Dec 27, 2023

hzxuzhonghu commented Dec 27, 2023

ramaraochavali commented Dec 27, 2023

ramaraochavali commented Dec 27, 2023

shonecyx commented Dec 28, 2023

shonecyx commented Dec 28, 2023 • edited

hzxuzhonghu commented Dec 28, 2023

hzxuzhonghu commented Dec 28, 2023

hzxuzhonghu commented Jan 4, 2024 • edited

ramaraochavali commented Jan 4, 2024

hzxuzhonghu commented Jan 5, 2024

ramaraochavali commented Jan 5, 2024

ramaraochavali commented Jan 6, 2024

hzxuzhonghu left a comment

Choose a reason for hiding this comment

hzxuzhonghu Jan 7, 2024

Choose a reason for hiding this comment

ramaraochavali Jan 8, 2024

Choose a reason for hiding this comment

ramaraochavali commented Jan 8, 2024

hzxuzhonghu commented Jan 8, 2024

howardjohn commented Jan 8, 2024 • edited

hzxuzhonghu commented Jan 9, 2024

howardjohn commented Jan 9, 2024

hzxuzhonghu commented Jan 9, 2024

ramaraochavali commented Jan 9, 2024

hzxuzhonghu commented Jan 9, 2024

howardjohn commented Jan 9, 2024

hzxuzhonghu commented Jan 10, 2024

shonecyx commented Jan 10, 2024

shonecyx commented Jan 16, 2024

ramaraochavali commented Jan 18, 2024

ramaraochavali commented Feb 8, 2024

hzxuzhonghu left a comment

Choose a reason for hiding this comment

ramaraochavali commented Dec 25, 2023 •

edited by istio-policy-bot

shonecyx commented Dec 27, 2023 •

edited

shonecyx commented Dec 28, 2023 •

edited

hzxuzhonghu commented Jan 4, 2024 •

edited

howardjohn commented Jan 8, 2024 •

edited