-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove send timeout #48543
remove send timeout #48543
Conversation
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Skipping CI for Draft Pull Request. |
/test release-notes |
start := time.Now() | ||
defer func() { recordSendTime(time.Since(start)) }() | ||
return conn.stream.Send(res) | ||
} | ||
err := istiogrpc.Send(conn.stream.Context(), sendHandler) | ||
err := sendResponse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this change can solve the issue mentioned in the PR description. When PILOT_XDS_SEND_TIMEOUT==0
, it seems that the behavior has not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PILOT_XDS_SEND_TIMEOUT is set to 20s
when that issue happened. It was a legacy value and later we changed to 0s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, Thanks.
I am not sure this is the cause of https://github.com/istio/istio/issues/48517, can you clarify |
FYI comment in the original issue #48517 (comment) |
I think this would help. When the CDS got stuck, the envoy main thread spike and igw is responding very slow and xds_proxy didn't get chance to have a successful downstream push to envoy within the timeout, so it will get stuck here and result to connection close for all pushes. @hzxuzhonghu @ramaraochavali BTW is it possible to write some test cases for that? |
I also think this should help
It is difficult to simulate this |
From system stability view, send timeout is necessary to prevent a thread blocking forever. It is by default no timeout, i think it is no harm leaving it there. |
Another example is http timeout, mostly we have a timeout setting, we want to reduce the influence to client side regardless of the other side's uncertainty. |
#31943 (comment) - See comments from here. We have purposefully disabled timeouts (Even Envoy does not have - why should xds proxy has?) and left this var as a short term gap. Please see recommendation from gRPC as well #31943 (comment) |
I have not read carefully the grpc xds implementation, but there is a timeout here https://github.com/grpc/grpc-go/blob/4f03f3ff32c983f2e9b030889041ff9d5ffb6aeb/xds/internal/xdsclient/singleton.go#L90
|
I think the problem here is that do we want to enforce |
It would defintely benefit pilot here if your one envoy need 100s to process one response. We have a max push concurrent number limit. |
What would benefit pilot? Are you saying send timeout benefits? |
I do not think there is a timeout. Based on what we have discussed here, we just need keep alives which we already have and do not need send timeouts. |
For this issue #48517, remove time out might help but doubt it's the root cause, otherwise restart istiod won't help, it seems some dead lock triggered maybe due to the time out. BTW if we agree on removing the time out, then we'd probably remove the the legacy flag |
Even today there is consistent time out in xds proxy downstream and upstream to varies of proxies, although it's not causing impact yet, that's probably due to the 20s We'll try to configure
|
Check pilot_xds_send_time metric to see how slow it is. NO timeout or increasing it is just a workaround, why it is slow, can you resolve it accordingly |
If you mean the cpu usage is very high, would increasing CPU limit help? |
Will wait to check after macConnectionAge |
Isn't keepalive intended to check whether the connection is broken or not? In your test it seems connection is live but at application you are blocking? |
Yeah, right, this is common when a client is busy not consuming the response. For a typical HTTP server we usually tune this setting. In istio may be maxConnectionAge is enough as all the clients are under control. The worst effect is leave the goroutine and socket leak for the duration |
@hzxuzhonghu if you do not have further concerns, can you please approve this? |
@hzxuzhonghu gentle ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little bit concerned about this may cause blocking silently, in the case we do not even have a way to know. If the client is blocked, istiod can still send xds response as much as socket buffer.
Though non related, we have to face the bad design we have now, the distribution state report is totally wrong after sending called. Looks at reportAllEvents
defer func() { | ||
// This is a hint to help debug slow responses. | ||
if time.Since(tStart) > 10*time.Second { | ||
proxyLog.Warnf("sendDownstream took %v", time.Since(tStart)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ramaraochavali This kind of info is helpful and no harm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I can add that part back if agree on removing the timeout
With timeouts we ran in to various issues as described in #31943. not sure how it will practically help |
Timeout is not the root cause of any issue. And not even fix the issue. You can choose not setting timeout, this is the default behavior of istio. In the case envoy is overloaded, increase timeout does not help at all, istiod can also be blocked. In my opinion, we can make xds proxy work as a intermediate buffer to mitigate the speed gap between xds producer and xds consumer. By this, at least istiod is able to handle other xds requests. In the worst case if the envoy is blocking undetermined, there is no way to recover at all. We should make control plane close the connection as soon. If envoy is unblocked or fast enough later, reconnect and resend xds request is also acceptable. |
I am not sure how we can say this... We have had 5+ outages caused by the timeout that are extensively documented in #31943. These were also found in Istio integration tests. The root cause of that WAS the timeouts, and since we removed the timeout by default 3 years ago we have never seen any issues like this again. In fact, the only similar issue we have seen during that time was when someone enabled the timeout (#48517). Currently, we have a setting that is: (1) off by default Why would we keep it around?
IMO there is still an outstanding issue that a malicious client can hog the semaphore around pushes. But moving this unboundedness to xds proxy does not solve this at all, as that is also a client-side issue. |
the influence is the least if move to sidecar side |
If we are worried about clients deadlocking istiod, then its not clear to me why fixing the agent solves the problem. We have at least 4 official supported xds clients - envoy, grpc, istioctl, and ztunnel. So we only solve 1/4 of the problem there? |
Making server have determined behavior is always the good practice, it can mitigate a malicious client's attacking. If you look at all the kube apiserver's implement, it has set related timeout
And take a lookat this blog post https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, it recommend always set timeout, for grpc it is similar.
|
We do have that configured via MaxConnAge. What we are debating here is individual message timeout(not connection timeout) which was prone to create issues as documented in #31943 . Our experience of setting timeout is also same as what @howardjohn described |
|
IdleTimeout and ReadHeaderTimeout are not correlated to this. This is about writing. The only place k8s sets WriteTimeout is with a 4 hour timeout, which is far beyond our MaxConn
|
If you read https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/, writeTimeout is not same as send timeout we have now. In k8s there are commands like kubectl logs exec, they can last very long. |
IMHO if we agree on the xds_proxy Meanwhile I think there is a valid concern of thread blocking due to slow pushes but that would still happen even without xds_proxy since there's no timeout in Envoy's send/recv. For this issue we can add metrics/log and worst case can just kill all blocking threads/connections but this is less likely to happen since the timeout has been removed for 3 years. |
Could we reach agreement on the root cause of cluster stuck at warming which was caused by the timeout? |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Me and @howardjohn are on the same page that timeout is a bad thing and can cause issues based on our experience but @hzxuzhonghu has other thoughts. If you can confirm removing timeout could help your scenario, It would really help. Not sure how we can move forward with this - I still think irrespective whether it helps #48517 or not, send timeout is not the right thing to do (I just can't think of how users can set a value to this and what happens to the entire xds flow if we run in to timeouts) |
@hzxuzhonghu While some of the points you brought about bad clients causing havoc is valid, I do not see how send timeout helps, unless I am missing some thing. Can we please merge this and think about future improvements? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve as we can rely on maxConnctionAge to close the connection even sending stuck
* remove send timeout Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix compile Signed-off-by: Rama Chavali <rama.rao@salesforce.com> --------- Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Send timeout is defaulted to 0s for many releases and is stable. Setting send timeout may cause unintended consequences like stream getting blocked etc.