Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: rebind services on connect errors #952

Merged
merged 1 commit into from
May 17, 2018
Merged

Conversation

seanmonstar
Copy link
Contributor

Instead of having connect errors destroy all buffered requests,
this changes Bind to return a service that can rebind itself when
there is a connect error.

It won't try to establish the new connection itself, but waits for
the buffer to poll again. Combing this with changes in tower-buffer
to remove canceled requests from the buffer should mean that we
won't loop on connect errors for forever.

Closes #899

@seanmonstar seanmonstar requested a review from olix0r May 14, 2018 21:06
// try to connect again.
match ready {
Err(ReconnectError::Connect(err)) => {
error!("connect error to {:?}: {}", self.endpoint, err);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'd rather more loudly log when connect errors occur, this seems to trigger several times during the test, because of how many times it retries before the metrics scrape noticed the TCP connection event.

This could be reduced from an error! to warn!, or lower... and/or we could also apply some backoff here so as to not loop back immediately, but after a second or something. (Currently, this yields to the executor, so any other work should be polled again first, and then this will get polled again...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be quite verbose/noisy in some cases. An easyish solution would be to only log once per consecutive error. something like

        match ready {
            Ok(..) => {
                self.logged_err = false;
            }
            Err(ReconnectError::Connect(err)) => {
                if !self.logged_err {
                    warn!(...);
                    self.logged_err = true;
                }
            }

@@ -217,16 +229,20 @@ where
Reconnect::new(proxy)
}

pub fn new_binding(&self, ep: &Endpoint, protocol: &Protocol) -> Binding<B> {
if protocol.can_reuse_clients() {
pub fn new_bound_service(&self, ep: &Endpoint, protocol: &Protocol) -> BoundService<B> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIOLI: this method was called bind_service before being renamed to new_bound_service, perhaps just change it back?

/// - If there is an error in the inner service (such as a connect error), we
/// need to throw it away and bind a new service.
pub struct BoundService<B: tower_h2::Body + 'static> {
bind: Bind<Arc<ctx::Proxy>, B>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to me that, since Bind::new_bound_service takes &self, this could be

pub struct BoundService<'a, B: tower_h2::Body + 'static> {
    bind: &'a Bind<Arc<ctx::Proxy>, B>,
    // ...
}

and then we wouldn't have to clone in new_bound_service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, we need to return a static Service, since it will live separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah, you're right, nevermind!

@@ -90,7 +102,7 @@ pub struct NormalizeUri<S> {
inner: S
}

pub type Service<B> = Binding<B>;
pub type Service<B> = BoundService<B>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this type alias still necessary? It was originally added to shorten a really long return type for what's now called "Stack<B>".

Consider either just renaming BoundService<B> to Service<B> and removing the type alias, or changing all consumers of this API to refer to BoundService<B> (which seems clearer IMHO) and removing the type alias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on trying this out, I found that other places using types from this module preferred the prefix, bind::Protocol, bind::HttpRequest, etc. So, seeing bind::Service actually felt clear.

///
/// # TODO
///
/// Buffering is currently unbounded and does not apply timeouts. This must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing this out of date comment! 👍

// try to connect again.
match ready {
Err(ReconnectError::Connect(err)) => {
error!("connect error to {:?}: {}", self.endpoint, err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be quite verbose/noisy in some cases. An easyish solution would be to only log once per consecutive error. something like

        match ready {
            Ok(..) => {
                self.logged_err = false;
            }
            Err(ReconnectError::Connect(err)) => {
                if !self.logged_err {
                    warn!(...);
                    self.logged_err = true;
                }
            }

// whoever owns this service will call `poll_ready` if they
// are still interested.
task::current().notify();
Ok(Async::NotReady)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the merits of doing this versus looping with the new state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make that clear in the comment, so it seems I should add to it.

If we loop in here, then we will be eagerly setting up a new connection, even if the buffer wrapping this service determines that it's queue of requests have run out. This instead allows the buffer to determine if it should loop or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that first sentence in the comment just needs to be rearranged a bit

// This service isn't ready yet. Instead of trying to make it ready,
// schedule the task for notification so that the caller can
// determine whether readiness is still necessary (i.e. whether
// there are still requests to be sent).

... or something like that...

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

Instead of having connect errors destroy all buffered requests,
this changes Bind to return a service that can rebind itself when
there is a connect error.

It won't try to establish the new connection itself, but waits for
the buffer to poll again. Combing this with changes in tower-buffer
to remove canceled requests from the buffer should mean that we
won't loop on connect errors for forever.

Signed-off-by: Sean McArthur <sean@seanmonstar.com>
@seanmonstar seanmonstar merged commit fb904f0 into master May 17, 2018
@seanmonstar seanmonstar deleted the proxy-buffer-closed branch May 17, 2018 21:18
khappucino pushed a commit to Nordstrom/linkerd2 that referenced this pull request Mar 5, 2019
Instead of having connect errors destroy all buffered requests,
this changes Bind to return a service that can rebind itself when
there is a connect error.

It won't try to establish the new connection itself, but waits for
the buffer to poll again. Combing this with changes in tower-buffer
to remove canceled requests from the buffer should mean that we
won't loop on connect errors for forever.

Signed-off-by: Sean McArthur <sean@seanmonstar.com>
hawkw added a commit to linkerd/linkerd2-proxy that referenced this pull request Aug 24, 2020
The proxy's integration tests depend on the `net2` crate, which has been
deprecated and replaced by `socket2`. Since `net2` is no longer actively
maintained, `cargo audit` will warn us about it, so we should replace it
with `socket2`.

While I was making this change, I was curious why we were manually
constructing and binding these sockets at all, rather than just using
`tokio::net::TcpListener::bind`. After some archaeology, I determined
that this was added in linkerd/linkerd2#952, which added a test that
requires a delay between when a socket is _bound_ and when it starts
_listening_. `tokio::net::TcpListener::bind` (as well as the `std::net`
version) perform these operations together. Since this wasn't obvious
from the test code, I went ahead and moved the new `socket2` version of
this into a pair of functions, with comments explaining why we didn't
just use `tokio::net`.

Fixes linkerd/linkerd2#4891
hawkw added a commit to linkerd/linkerd2-proxy that referenced this pull request Aug 24, 2020
The proxy's integration tests depend on the `net2` crate, which has been
deprecated and replaced by `socket2`. Since `net2` is no longer actively
maintained, `cargo audit` will warn us about it, so we should replace it
with `socket2`.

While I was making this change, I was curious why we were manually
constructing and binding these sockets at all, rather than just using
`tokio::net::TcpListener::bind`. After some archaeology, I determined
that this was added in linkerd/linkerd2#952, which added a test that
requires a delay between when a socket is _bound_ and when it starts
_listening_. `tokio::net::TcpListener::bind` (as well as the `std::net`
version) perform these operations together. Since this wasn't obvious
from the test code, I went ahead and moved the new `socket2` version of
this into a pair of functions, with comments explaining why we didn't
just use `tokio::net`.

Fixes linkerd/linkerd2#4891
panthervis added a commit to panthervis/linkerd2-proxy that referenced this pull request Oct 8, 2021
The proxy's integration tests depend on the `net2` crate, which has been
deprecated and replaced by `socket2`. Since `net2` is no longer actively
maintained, `cargo audit` will warn us about it, so we should replace it
with `socket2`.

While I was making this change, I was curious why we were manually
constructing and binding these sockets at all, rather than just using
`tokio::net::TcpListener::bind`. After some archaeology, I determined
that this was added in linkerd/linkerd2#952, which added a test that
requires a delay between when a socket is _bound_ and when it starts
_listening_. `tokio::net::TcpListener::bind` (as well as the `std::net`
version) perform these operations together. Since this wasn't obvious
from the test code, I went ahead and moved the new `socket2` version of
this into a pair of functions, with comments explaining why we didn't
just use `tokio::net`.

Fixes linkerd/linkerd2#4891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants