Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove send timeout #48543

Merged
merged 4 commits into from
Feb 17, 2024
Merged

Conversation

ramaraochavali
Copy link
Contributor

@ramaraochavali ramaraochavali commented Dec 25, 2023

Send timeout is defaulted to 0s for many releases and is stable. Setting send timeout may cause unintended consequences like stream getting blocked etc.

  • Ambient
  • Configuration Infrastructure
  • Docs
  • Dual Stack
  • Installation
  • Networking
  • Performance and Scalability
  • Policies and Telemetry
  • Security
  • Test and Release
  • User Experience
  • Developer Infrastructure

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Dec 25, 2023
@ramaraochavali ramaraochavali marked this pull request as ready for review December 25, 2023 08:45
@ramaraochavali ramaraochavali requested review from a team as code owners December 25, 2023 08:45
@istio-testing istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. labels Dec 25, 2023
@ramaraochavali ramaraochavali added the release-notes-none Indicates a PR that does not require release notes. label Dec 25, 2023
@ramaraochavali
Copy link
Contributor Author

/test release-notes

start := time.Now()
defer func() { recordSendTime(time.Since(start)) }()
return conn.stream.Send(res)
}
err := istiogrpc.Send(conn.stream.Context(), sendHandler)
err := sendResponse()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this change can solve the issue mentioned in the PR description. When PILOT_XDS_SEND_TIMEOUT==0, it seems that the behavior has not changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PILOT_XDS_SEND_TIMEOUT is set to 20s when that issue happened. It was a legacy value and later we changed to 0s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, Thanks.

@hzxuzhonghu
Copy link
Member

I am not sure this is the cause of https://github.com/istio/istio/issues/48517, can you clarify

@ramaraochavali
Copy link
Contributor Author

I am not sure this is the cause of #48517 can you clarify

It is not 100% certain unless we test for that specific case. I have requested to test without timeout. But with Send Timeout we have seen such blocking issues happen #31943

@shonecyx
Copy link
Contributor

I am not sure this is the cause of #48517 can you clarify

It is not 100% certain unless we test for that specific case. I have requested to test without timeout. But with Send Timeout we have seen such blocking issues happen #31943

FYI comment in the original issue #48517 (comment)
We cannot consistently reproduce this issue, actually it's hard to reproduce.

@shonecyx
Copy link
Contributor

shonecyx commented Dec 27, 2023

I think this would help. When the CDS got stuck, the envoy main thread spike and igw is responding very slow and xds_proxy didn't get chance to have a successful downstream push to envoy within the timeout, so it will get stuck here and result to connection close for all pushes. @hzxuzhonghu @ramaraochavali

BTW is it possible to write some test cases for that?

@ramaraochavali
Copy link
Contributor Author

I also think this should help

BTW is it possible to write some test cases for that?

It is difficult to simulate this

@hzxuzhonghu
Copy link
Member

From system stability view, send timeout is necessary to prevent a thread blocking forever. It is by default no timeout, i think it is no harm leaving it there.

@hzxuzhonghu
Copy link
Member

Another example is http timeout, mostly we have a timeout setting, we want to reduce the influence to client side regardless of the other side's uncertainty.

@ramaraochavali
Copy link
Contributor Author

#31943 (comment) - See comments from here. We have purposefully disabled timeouts (Even Envoy does not have - why should xds proxy has?) and left this var as a short term gap. Please see recommendation from gRPC as well #31943 (comment)

@hzxuzhonghu
Copy link
Member

I have not read carefully the grpc xds implementation, but there is a timeout here https://github.com/grpc/grpc-go/blob/4f03f3ff32c983f2e9b030889041ff9d5ffb6aeb/xds/internal/xdsclient/singleton.go#L90

	// Create the new client implementation.
	c, err := newWithConfig(config, defaultWatchExpiryTimeout, defaultIdleAuthorityDeleteTimeout)
	if err != nil {
		return nil, nil, err
	}

@shonecyx
Copy link
Contributor

I think the problem here is that do we want to enforce PILOT_XDS_SEND_TIMEOUT = 0s for both xds_proxy upstream and downstream, since the default value already changed to 0s since 1.14, while it was 5s default in 1.10 and we tuned to 20s(we were not aware of it changed to 0s when we upgraded to 1.17), but if we don't see benefit with none 0s time out and don't see impact like thread locking since 1.14 with 0s timeout, it would be helpful to adress such XDS push timeout issue.

@hzxuzhonghu
Copy link
Member

It would defintely benefit pilot here if your one envoy need 100s to process one response. We have a max push concurrent number limit.

@ramaraochavali
Copy link
Contributor Author

It would defintely benefit pilot here if your one envoy need 100s to process one response.

What would benefit pilot? Are you saying send timeout benefits?

@ramaraochavali
Copy link
Contributor Author

I have not read carefully the grpc xds implementation, but there is a timeout here

I do not think there is a timeout. Based on what we have discussed here, we just need keep alives which we already have and do not need send timeouts.

@shonecyx
Copy link
Contributor

For this issue #48517, remove time out might help but doubt it's the root cause, otherwise restart istiod won't help, it seems some dead lock triggered maybe due to the time out.

BTW if we agree on removing the time out, then we'd probably remove the the legacy flag PILOT_XDS_SEND_TIMEOUT as well, since leave it here with none zero value is conflicting with removing time out. As for the potential thread locking, it should be covered by keepalive and if it's not enough we might have other mechanism to avoid like just close the connection hourly.

@shonecyx
Copy link
Contributor

shonecyx commented Dec 28, 2023

Even today there is consistent time out in xds proxy downstream and upstream to varies of proxies, although it's not causing impact yet, that's probably due to the 20s PILOT_XDS_SEND_TIMEOUT, while configure the PILOT_XDS_SEND_TIMEOUT to 0s is equivalent to merge this PR.

We'll try to configure PILOT_XDS_SEND_TIMEOUT to 0s to see if it helps.

Screenshot 2023-12-28 at 15 29 19

  |   | 2023-12-28 15:29:45.002 | 2023-12-28T07:29:45.143319Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:0/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 15:29:45.002 | 2023-12-28T07:29:45.143266Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-702411 |  
  |   | 2023-12-28 14:25:46.004 | 2023-12-28T06:25:46.343828Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:338/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 14:25:46.004 | 2023-12-28T06:25:46.343792Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-699595 |  
  |   | 2023-12-28 13:22:16.004 | 2023-12-28T05:22:16.244596Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:137/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 13:22:16.004 | 2023-12-28T05:22:16.244556Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-696880 |  
  |   | 2023-12-28 12:52:37.001 | 2023-12-28T04:52:37.310135Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [1861] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF	thread=31 |  
  |   | 2023-12-28 12:52:37.001 | 2023-12-28T04:52:37.309415Z	warn	xdsproxy	upstream [1861] terminated with unexpected error upstream [1861] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF |  
  |   | 2023-12-28 12:24:38.000 | 2023-12-28T04:24:38.344974Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:65/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 12:24:38.000 | 2023-12-28T04:24:38.344942Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-694356 |  
  |   | 2023-12-28 11:55:41.006 | 2023-12-28T03:55:41.144268Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:65/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 11:55:41.006 | 2023-12-28T03:55:41.144233Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-693163 |  
  |   | 2023-12-28 09:54:41.009 | 2023-12-28T01:54:41.245604Z	warn	ads	SDS: Send failure for node:istio-ingressgateway-79cb4b564b-4ljnw.istio-staging resources:1140 size:10.0MB cached:16/1140: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-28 09:54:41.009 | 2023-12-28T01:54:41.245569Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-4ljnw.istio-staging-687729 |  
  |   | 2023-12-28 09:54:09.001 | 2023-12-28T01:54:09.991006Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [1855] send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF	thread=31 |  
  |   | 2023-12-28 09:54:09.001 | 2023-12-28T01:54:09.990451Z	warn	xdsproxy	upstream [1855] terminated with unexpected error upstream [1855] send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF

@hzxuzhonghu
Copy link
Member

Check pilot_xds_send_time metric to see how slow it is. NO timeout or increasing it is just a workaround, why it is slow, can you resolve it accordingly

@hzxuzhonghu
Copy link
Member

If you mean the cpu usage is very high, would increasing CPU limit help?

@hzxuzhonghu
Copy link
Member

hzxuzhonghu commented Jan 4, 2024

Will wait to check after macConnectionAge
Update: Istiod will close the underlying connection after maxConnectionAge, but keepalive does not in this test

@ramaraochavali
Copy link
Contributor Author

Isn't keepalive intended to check whether the connection is broken or not? In your test it seems connection is live but at application you are blocking?

@hzxuzhonghu
Copy link
Member

Yeah, right, this is common when a client is busy not consuming the response. For a typical HTTP server we usually tune this setting. In istio may be maxConnectionAge is enough as all the clients are under control. The worst effect is leave the goroutine and socket leak for the duration

@ramaraochavali
Copy link
Contributor Author

@hzxuzhonghu if you do not have further concerns, can you please approve this?

@ramaraochavali
Copy link
Contributor Author

@hzxuzhonghu gentle ping

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit concerned about this may cause blocking silently, in the case we do not even have a way to know. If the client is blocked, istiod can still send xds response as much as socket buffer.
Though non related, we have to face the bad design we have now, the distribution state report is totally wrong after sending called. Looks at reportAllEvents

defer func() {
// This is a hint to help debug slow responses.
if time.Since(tStart) > 10*time.Second {
proxyLog.Warnf("sendDownstream took %v", time.Since(tStart))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali This kind of info is helpful and no harm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I can add that part back if agree on removing the timeout

@ramaraochavali
Copy link
Contributor Author

I am a little bit concerned about this may cause blocking silently, in the case we do not even have a way to know. If the client is blocked, istiod can still send xds response as much as socket buffer.

With timeouts we ran in to various issues as described in #31943. not sure how it will practically help

@hzxuzhonghu
Copy link
Member

Timeout is not the root cause of any issue. And not even fix the issue. You can choose not setting timeout, this is the default behavior of istio.

In the case envoy is overloaded, increase timeout does not help at all, istiod can also be blocked. In my opinion, we can make xds proxy work as a intermediate buffer to mitigate the speed gap between xds producer and xds consumer. By this, at least istiod is able to handle other xds requests.

In the worst case if the envoy is blocking undetermined, there is no way to recover at all. We should make control plane close the connection as soon. If envoy is unblocked or fast enough later, reconnect and resend xds request is also acceptable.

@howardjohn
Copy link
Member

howardjohn commented Jan 8, 2024

Timeout is not the root cause of any issue.

I am not sure how we can say this... We have had 5+ outages caused by the timeout that are extensively documented in #31943. These were also found in Istio integration tests. The root cause of that WAS the timeouts, and since we removed the timeout by default 3 years ago we have never seen any issues like this again. In fact, the only similar issue we have seen during that time was when someone enabled the timeout (#48517).

Currently, we have a setting that is:

(1) off by default
(2) has proven to be stable to not set it for 3 years
(3) is know to cause outages when enabled

Why would we keep it around?

In the case envoy is overloaded, increase timeout does not help at all, istiod can also be blocked. In my opinion, we can make xds proxy work as a intermediate buffer to mitigate the speed gap between xds producer and xds consumer. By this, at least istiod is able to handle other xds requests.
In the worst case if the envoy is blocking undetermined, there is no way to recover at all. We should make control plane close the connection as soon. If envoy is unblocked or fast enough later, reconnect and resend xds request is also acceptable.

IMO there is still an outstanding issue that a malicious client can hog the semaphore around pushes. But moving this unboundedness to xds proxy does not solve this at all, as that is also a client-side issue.

@hzxuzhonghu
Copy link
Member

the influence is the least if move to sidecar side

@howardjohn
Copy link
Member

the influence is the least if move to sidecar side

If we are worried about clients deadlocking istiod, then its not clear to me why fixing the agent solves the problem. We have at least 4 official supported xds clients - envoy, grpc, istioctl, and ztunnel. So we only solve 1/4 of the problem there?

@hzxuzhonghu
Copy link
Member

Making server have determined behavior is always the good practice, it can mitigate a malicious client's attacking. If you look at all the kube apiserver's implement, it has set related timeout IdleTimeout.

	secureServer := &http.Server{
		Addr:           s.Listener.Addr().String(),
		Handler:        handler,
		MaxHeaderBytes: 1 << 20,
		TLSConfig:      tlsConfig,

		IdleTimeout:       90 * time.Second, // matches http.DefaultTransport keep-alive timeout
		ReadHeaderTimeout: 32 * time.Second, // just shy of requestTimeoutUpperBound
	}

And take a lookat this blog post https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, it recommend always set timeout, for grpc it is similar.

It's critical for an HTTP server exposed to the Internet to enforce timeouts on client connections. Otherwise very slow or disappearing clients might leak file descriptors and eventually result in something along the lines of:

@ramaraochavali
Copy link
Contributor Author

It's critical for an HTTP server exposed to the Internet to enforce timeouts on client connections.

We do have that configured via MaxConnAge. What we are debating here is individual message timeout(not connection timeout) which was prone to create issues as documented in #31943 . Our experience of setting timeout is also same as what @howardjohn described

@hzxuzhonghu
Copy link
Member

MaxConnAge is for load balance, it is far longer than we want, we want to depend on something like keepalive, however it could not disclose connection during my test

@howardjohn
Copy link
Member

IdleTimeout and ReadHeaderTimeout are not correlated to this. This is about writing. The only place k8s sets WriteTimeout is with a 4 hour timeout, which is far beyond our MaxConn

pkg/kubelet/server/server.go
169:            WriteTimeout:   4 * 60 * time.Minute,
203:            WriteTimeout:   4 * 60 * time.Minute,

@hzxuzhonghu
Copy link
Member

If you read https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, writeTimeout is not same as send timeout we have now. In k8s there are commands like kubectl logs exec, they can last very long.

@shonecyx
Copy link
Contributor

IMHO if we agree on the xds_proxy PILOT_XDS_SEND_TIMEOUT is the root cause of #48517, which to my understanding is the incomplete pushes caused unexpected behaivor like CDS stuck, then we should fix this bug by removing the flag.

Meanwhile I think there is a valid concern of thread blocking due to slow pushes but that would still happen even without xds_proxy since there's no timeout in Envoy's send/recv. For this issue we can add metrics/log and worst case can just kill all blocking threads/connections but this is less likely to happen since the timeout has been removed for 3 years.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 10, 2024
@shonecyx
Copy link
Contributor

Could we reach agreement on the root cause of cluster stuck at warming which was caused by the timeout?

@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 18, 2024
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

Could we reach agreement on the root cause of #48517 which was caused by the timeout?

Me and @howardjohn are on the same page that timeout is a bad thing and can cause issues based on our experience but @hzxuzhonghu has other thoughts. If you can confirm removing timeout could help your scenario, It would really help.

Not sure how we can move forward with this - I still think irrespective whether it helps #48517 or not, send timeout is not the right thing to do (I just can't think of how users can set a value to this and what happens to the entire xds flow if we run in to timeouts)

@ramaraochavali
Copy link
Contributor Author

@hzxuzhonghu While some of the points you brought about bad clients causing havoc is valid, I do not see how send timeout helps, unless I am missing some thing. Can we please merge this and think about future improvements?

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as we can rely on maxConnctionAge to close the connection even sending stuck

@istio-testing istio-testing merged commit 36a2515 into istio:master Feb 17, 2024
28 checks passed
@ramaraochavali ramaraochavali deleted the fix/remove_xds_send branch February 18, 2024 06:11
thedebugger pushed a commit to thedebugger/istio that referenced this pull request Mar 5, 2024
* remove send timeout

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix compile

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

---------

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes-none Indicates a PR that does not require release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants