proxy: rebind services on connect errors #952

seanmonstar · 2018-05-14T21:06:18Z

Instead of having connect errors destroy all buffered requests,
this changes Bind to return a service that can rebind itself when
there is a connect error.

It won't try to establish the new connection itself, but waits for
the buffer to poll again. Combing this with changes in tower-buffer
to remove canceled requests from the buffer should mean that we
won't loop on connect errors for forever.

Closes #899

seanmonstar · 2018-05-14T21:09:08Z

proxy/src/bind.rs

+        // try to connect again.
+        match ready {
+            Err(ReconnectError::Connect(err)) => {
+                error!("connect error to {:?}: {}", self.endpoint, err);


While I'd rather more loudly log when connect errors occur, this seems to trigger several times during the test, because of how many times it retries before the metrics scrape noticed the TCP connection event.

This could be reduced from an error! to warn!, or lower... and/or we could also apply some backoff here so as to not loop back immediately, but after a second or something. (Currently, this yields to the executor, so any other work should be polled again first, and then this will get polled again...)

I agree that this could be quite verbose/noisy in some cases. An easyish solution would be to only log once per consecutive error. something like

match ready { Ok(..) => { self.logged_err = false; } Err(ReconnectError::Connect(err)) => { if !self.logged_err { warn!(...); self.logged_err = true; } }

hawkw · 2018-05-14T21:54:38Z

proxy/src/bind.rs

@@ -217,16 +229,20 @@ where
        Reconnect::new(proxy)
    }

-    pub fn new_binding(&self, ep: &Endpoint, protocol: &Protocol) -> Binding<B> {
-        if protocol.can_reuse_clients() {
+    pub fn new_bound_service(&self, ep: &Endpoint, protocol: &Protocol) -> BoundService<B> {


TIOLI: this method was called bind_service before being renamed to new_bound_service, perhaps just change it back?

hawkw · 2018-05-14T21:56:45Z

proxy/src/bind.rs

+/// - If there is an error in the inner service (such as a connect error), we
+///   need to throw it away and bind a new service.
+pub struct BoundService<B: tower_h2::Body + 'static> {
+    bind: Bind<Arc<ctx::Proxy>, B>,


seems to me that, since Bind::new_bound_service takes &self, this could be

pub struct BoundService<'a, B: tower_h2::Body + 'static> { bind: &'a Bind<Arc<ctx::Proxy>, B>, // ... }

and then we wouldn't have to clone in new_bound_service?

We can't, we need to return a static Service, since it will live separately.

ah, yeah, you're right, nevermind!

hawkw · 2018-05-14T22:01:05Z

proxy/src/bind.rs

@@ -90,7 +102,7 @@ pub struct NormalizeUri<S> {
    inner: S
 }

-pub type Service<B> = Binding<B>;
+pub type Service<B> = BoundService<B>;


Is this type alias still necessary? It was originally added to shorten a really long return type for what's now called "Stack<B>".

Consider either just renaming BoundService<B> to Service<B> and removing the type alias, or changing all consumers of this API to refer to BoundService<B> (which seems clearer IMHO) and removing the type alias.

Actually, on trying this out, I found that other places using types from this module preferred the prefix, bind::Protocol, bind::HttpRequest, etc. So, seeing bind::Service actually felt clear.

hawkw · 2018-05-14T22:02:23Z

proxy/src/outbound.rs

-    ///
-    /// # TODO
-    ///
-    /// Buffering is currently unbounded and does not apply timeouts. This must be


Thanks for removing this out of date comment! 👍

olix0r · 2018-05-14T22:47:53Z

proxy/src/bind.rs

+        // try to connect again.
+        match ready {
+            Err(ReconnectError::Connect(err)) => {
+                error!("connect error to {:?}: {}", self.endpoint, err);


I agree that this could be quite verbose/noisy in some cases. An easyish solution would be to only log once per consecutive error. something like

match ready { Ok(..) => { self.logged_err = false; } Err(ReconnectError::Connect(err)) => { if !self.logged_err { warn!(...); self.logged_err = true; } }

olix0r · 2018-05-14T23:30:17Z

proxy/src/bind.rs

+                // whoever owns this service will call `poll_ready` if they
+                // are still interested.
+                task::current().notify();
+                Ok(Async::NotReady)


What are the merits of doing this versus looping with the new state?

I tried to make that clear in the comment, so it seems I should add to it.

If we loop in here, then we will be eagerly setting up a new connection, even if the buffer wrapping this service determines that it's queue of requests have run out. This instead allows the buffer to determine if it should loop or not.

I think that first sentence in the comment just needs to be rearranged a bit

// This service isn't ready yet. Instead of trying to make it ready, // schedule the task for notification so that the caller can // determine whether readiness is still necessary (i.e. whether // there are still requests to be sent).

... or something like that...

olix0r

looks good to me!

Instead of having connect errors destroy all buffered requests, this changes Bind to return a service that can rebind itself when there is a connect error. It won't try to establish the new connection itself, but waits for the buffer to poll again. Combing this with changes in tower-buffer to remove canceled requests from the buffer should mean that we won't loop on connect errors for forever. Signed-off-by: Sean McArthur <sean@seanmonstar.com>

The proxy's integration tests depend on the `net2` crate, which has been deprecated and replaced by `socket2`. Since `net2` is no longer actively maintained, `cargo audit` will warn us about it, so we should replace it with `socket2`. While I was making this change, I was curious why we were manually constructing and binding these sockets at all, rather than just using `tokio::net::TcpListener::bind`. After some archaeology, I determined that this was added in linkerd/linkerd2#952, which added a test that requires a delay between when a socket is _bound_ and when it starts _listening_. `tokio::net::TcpListener::bind` (as well as the `std::net` version) perform these operations together. Since this wasn't obvious from the test code, I went ahead and moved the new `socket2` version of this into a pair of functions, with comments explaining why we didn't just use `tokio::net`. Fixes linkerd/linkerd2#4891

seanmonstar requested a review from olix0r May 14, 2018 21:06

seanmonstar commented May 14, 2018

View reviewed changes

hawkw reviewed May 14, 2018

View reviewed changes

olix0r reviewed May 14, 2018

View reviewed changes

olix0r approved these changes May 16, 2018

View reviewed changes

seanmonstar force-pushed the proxy-buffer-closed branch from 7328f1f to 87dc131 Compare May 17, 2018 19:11

seanmonstar force-pushed the proxy-buffer-closed branch from 87dc131 to 7a31f59 Compare May 17, 2018 19:24

seanmonstar merged commit fb904f0 into master May 17, 2018

seanmonstar deleted the proxy-buffer-closed branch May 17, 2018 21:18

olix0r mentioned this pull request May 24, 2018

Proxy does not honor request cancellation #986

Closed

hawkw mentioned this pull request Aug 24, 2020

proxy: replace net2 with socket2 #4891

Closed

hawkw mentioned this pull request Aug 24, 2020

test: replace net2 dependency with socket2 linkerd/linkerd2-proxy#635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy: rebind services on connect errors #952

proxy: rebind services on connect errors #952

seanmonstar commented May 14, 2018

seanmonstar May 14, 2018

olix0r May 14, 2018

hawkw May 14, 2018

hawkw May 14, 2018

seanmonstar May 17, 2018

hawkw May 17, 2018

hawkw May 14, 2018

seanmonstar May 17, 2018

hawkw May 14, 2018

olix0r May 14, 2018

olix0r May 14, 2018

seanmonstar May 16, 2018

olix0r May 16, 2018

olix0r May 16, 2018

olix0r left a comment

proxy: rebind services on connect errors #952

proxy: rebind services on connect errors #952

Conversation

seanmonstar commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olix0r left a comment

Choose a reason for hiding this comment