-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Description
Describe the feature request
Almost all 503's I see are connection resets.
{"level":"debug","time":"2024-06-25T09:10:35.552211Z","scope":"envoy router","msg":"[Tags: \"ConnectionId\":\"621\",\"StreamId\":\"15076901503968233471\"] upstream reset: reset reason: connection termination, transport failure reason: ","caller":"external/envoy/source/common/router/router.cc:1332","thread":55}
These are connection resets between the Destination sidecar and the local app. I've wrote about them before here connection resets.
Many istio users run up against this, it's a pain to debug, and a pain to resolve, and continually comes back. I believe if the destination sidecar implemented a retry policy on reset to the local app, it'd be a huge improvement for users of Istio out of the box.
The challenge at the moment is the current DestinationRule has a retryOn, but that's describing the connection between the source sidecar and the destination. The resets described in the above scenario do not present at the source sidecar as a reset, they just present as a 503. Arguably you could retryOn 503's but that's a much bigger blast radius than simply retrying on connection reset.
As an added thought, retryOn: reset carries some risk, in that you could in theory retry requests that did make it to the destination service, which is a behaviour you may not want.
envoyproxy/envoy#10007 (unfortunately old and closed) detailed a reset-before-request proposal, eg a safe reset. Implementing this feature, and having it as the default on both source sidecar -> destination sidecar (eg retryOn in DestinationRule), and destination sidecar -> destination app (currently not configurable) I believe would resolve most users 503's.
Describe alternatives you've considered
There are no alternatives, currently users have to spend extremely large amounts of time trying to debug the reason for the 503 at the destination.
Affected product area (please put an X in all that apply)
[ ] Ambient
[ ] Docs
[ ] Dual Stack
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure
Affected features (please put an X in all that apply)
[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane
Additional context