Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: Attempt to reconnect on connection errors #1440

Closed
wants to merge 1 commit into from

Conversation

vishvananda
Copy link

This modifies the proxy dial command to retry to retry if it hits an
error. Previously it would pass a timeout, but not retry if there
is a connection error of some sort. It also adds a similar retry
for the listening socket.

This modifies the proxy dial command to retry to retry if it hits an
error. Previously it would pass a timeout, but not retry if there
is a connection error of some sort. It also adds a similar retry
for the listening socket.
remaining := endTime.Sub(time.Now())
outConn, err := net.DialTimeout(network, address, remaining)
if err != nil {
if endTime.After(time.Now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reorg this as

if now.After(end) {
return err
}
retry

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@thockin
Copy link
Member

thockin commented Sep 25, 2014

Can you explain what the motivation for this is? Did you find a place where retrying a dial made a difference in case of a non-timeout error? I'd almost rather immeidately fall back on the next endpoint it a dial fails...

@vishvananda
Copy link
Author

So both of these issues might only apply to my usage of proxier, but I will explain the rationale. If the proxy is dynamically constructed, then it is possible that the receiving service has not finished being set up yet, so retrying allows it to hold the connection open while it tries. Failing over to the next item is fine if there is another one in the list. Perhaps I could move the retry so that it cycles through the list of backends trying to find one it can connect to until the timeout is reached? The listen side is because the listen can fail if the socket has not yet been released by a previous listener, or if networking has not finished initializing yet (this happens in the case of new network namespaces).

The listen can be retried by the caller so it isn't necessarily needed, although currently the failure will be eaten by OnUpdate. The dial is initiated by someone connecting to the proxy so doing what we can to find a connection before sending an RST to the caller seems preferable.

That said, I'm using proxy in a non-standard way so If either of these things don't seem appropriate to the usage in Kubernetes feel free to say so.

@thockin
Copy link
Member

thockin commented Sep 25, 2014

I can get behind making dial more robust, for sure. I'm less convinced about listen. Can we maybe break this into two parts?

Doing dial in a more robust way would be great, something like

retry up to N seconds on a single backend before giving up and trying the next backend, with a total timeout of M seconds. E.g.

try backend 1, fail, sleep 100ms, try again - up to 1 second total, then try the next backend, same pattern, if a total timeout of 5 seconds elapses, give up.

The numbers can be experimented with - good logging will help.

@brendandburns
Copy link
Contributor

@vishvananda any thoughts on this?

@thockin
Copy link
Member

thockin commented Oct 20, 2014

Closing this for now. @vishvananda if you come back to this, please re-open.

@thockin thockin closed this Oct 20, 2014
tkashem pushed a commit to tkashem/kubernetes that referenced this pull request Dec 15, 2022
…-admission-defaults

OCPBUGS-4658: Apply shared defaulters to CRD-based routes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants