-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy: Attempt to reconnect on connection errors #1440
Conversation
This modifies the proxy dial command to retry to retry if it hits an error. Previously it would pass a timeout, but not retry if there is a connection error of some sort. It also adds a similar retry for the listening socket.
cbfb460
to
4a53711
Compare
remaining := endTime.Sub(time.Now()) | ||
outConn, err := net.DialTimeout(network, address, remaining) | ||
if err != nil { | ||
if endTime.After(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you reorg this as
if now.After(end) {
return err
}
retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
Can you explain what the motivation for this is? Did you find a place where retrying a dial made a difference in case of a non-timeout error? I'd almost rather immeidately fall back on the next endpoint it a dial fails... |
So both of these issues might only apply to my usage of proxier, but I will explain the rationale. If the proxy is dynamically constructed, then it is possible that the receiving service has not finished being set up yet, so retrying allows it to hold the connection open while it tries. Failing over to the next item is fine if there is another one in the list. Perhaps I could move the retry so that it cycles through the list of backends trying to find one it can connect to until the timeout is reached? The listen side is because the listen can fail if the socket has not yet been released by a previous listener, or if networking has not finished initializing yet (this happens in the case of new network namespaces). The listen can be retried by the caller so it isn't necessarily needed, although currently the failure will be eaten by OnUpdate. The dial is initiated by someone connecting to the proxy so doing what we can to find a connection before sending an RST to the caller seems preferable. That said, I'm using proxy in a non-standard way so If either of these things don't seem appropriate to the usage in Kubernetes feel free to say so. |
I can get behind making dial more robust, for sure. I'm less convinced about listen. Can we maybe break this into two parts? Doing dial in a more robust way would be great, something like retry up to N seconds on a single backend before giving up and trying the next backend, with a total timeout of M seconds. E.g. try backend 1, fail, sleep 100ms, try again - up to 1 second total, then try the next backend, same pattern, if a total timeout of 5 seconds elapses, give up. The numbers can be experimented with - good logging will help. |
@vishvananda any thoughts on this? |
Closing this for now. @vishvananda if you come back to this, please re-open. |
…-admission-defaults OCPBUGS-4658: Apply shared defaulters to CRD-based routes.
This modifies the proxy dial command to retry to retry if it hits an
error. Previously it would pass a timeout, but not retry if there
is a connection error of some sort. It also adds a similar retry
for the listening socket.