Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancers fall back to ORIG_DST when no endpoints exist #248

Merged
merged 16 commits into from
May 15, 2019

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented May 8, 2019

Currently, when no endpoints exist in the load balancer for a
destination, we fail the request. This is because we expect endpoints to
be discovered by both destination service queries and DNS lookups, so
if there are no endpoints for a destination, it is assumed to not exist.

In linkerd/linkerd2#2661, we intend to remove the DNS lookup from the
proxy and instead fall back to routing requests for which no endpoints
exist in the destination service to their SO_ORIGINAL_DST IP address.
This means that the current approach of failing requests when the load
balancer has no endpoints will no longer work.

This branch introduces a generic fallback layer, which composes a
primary and secondary service builder into a new layer. The primary
service can fail requests with an error type that propages the original
request, allowing the fallback middleware to call the fallback service
with the same request. Other errors returned by the primary service are
still propagated upstream.

In contrast to the approach used in #240, this fallback middleware is
generic and not tied directly to a load balancer or a router, and can
be used for other purposes in the future. It relies on the router cache
eviction added in #247 to drain the router when it is not being used,
rather than proactively destroying the router when endpoints are
available for the lb, and re-creating it when they exist again.

A new trait, HasEndpointStatus, is added in order to allow the
discovery lookup to communicate the "no endpoints" state to the
balancer. In addition, we add a new Update::NoEndpoints variant to
proxy::resolve::Update, so that when the control plane sends a no
endpoints update, we switch from the balancer to the no endpoints state
immediately, rather than waiting for all the endpoints to be
individually removed. When the balancer has no endpoints, it fails all
requests with a fallback error, so that the fallback middleware

A future commit will remove the DNS lookups from the discovery module.

Closes #240.

Signed-off-by: Eliza Weisman eliza@buoyant.io

@hawkw hawkw requested review from olix0r and kleimkuhler May 8, 2019 22:31
@hawkw hawkw self-assigned this May 8, 2019
src/app/main.rs Outdated Show resolved Hide resolved
impl<Req, Rec> Clone for Layer<Req, Rec>
where
Rec: Recognize<Req> + Clone + Send + Sync + 'static,
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous derived Clone implementation for Layer (and for Stack) required that the request type also be Clone, even though it's a PhantomData; this had to be fixed since the fallback layer requires cloning the fallback service.

src/app/main.rs Outdated Show resolved Hide resolved
src/proxy/http/fallback.rs Outdated Show resolved Hide resolved
src/proxy/http/fallback.rs Show resolved Hide resolved
src/proxy/http/fallback.rs Show resolved Hide resolved
src/proxy/http/fallback.rs Outdated Show resolved Hide resolved
src/proxy/http/fallback.rs Show resolved Hide resolved
src/proxy/http/fallback.rs Show resolved Hide resolved
src/proxy/http/fallback.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I also found the type Layer<...> in fallback.rs to be unclear but it's not a blocker for me.

When we ready Ready for MakeFallback still needs to be figured out. I think I prefer the way the way it currently is -- applying backpressure when making the service instead of falling back and having the possibility that the fallback service is not ready.

src/app/main.rs Outdated Show resolved Hide resolved
src/proxy/http/fallback.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented May 14, 2019

@olix0r and @kleimkuhler, I've made all the changes you requested. I think we still need to agree on what the correct thing to do when the fallback service is not ready is.

hawkw added a commit that referenced this pull request May 14, 2019
Currently, router `layer`s are constructed with a single argument, a
type implementing `Recognize`. Then, the entire router stack is built
with a `router::Config`. However, in #248, it became necessary to
provide the config up front when constructing the `router::layer`, as
the layer is used in a fallback layer. Rather than providing a separate
type for a preconfigured layer, @olix0r suggested we simply change all
router layers to accept the `Config` when they're constructed (see
#248 (comment)).

This branch changes `router::Layer` to accept the config up front. The
`router::Stack` types `make` function now requires no arguments, and the
implementation of `Service` for `Stack` can be called with any `T` (as
the target is now ignored).

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added 13 commits May 14, 2019 14:40
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added 3 commits May 15, 2019 11:16
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw requested a review from olix0r May 15, 2019 18:45
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, here's how I tested this:

  1. deployed strest with a proxy from this branch
  2. confirmed traffic to the dst pod
  3. change the service so that no pods are selected in the service
  4. confirm that traffic continues to the pod despite the service being emptied
  5. add a new client (that can't resolve the service name to any ips) and confirm that it cannot send requests

@hawkw hawkw merged commit b70c68d into master May 15, 2019
hawkw added a commit that referenced this pull request May 15, 2019
The `tower-balance` load balancer's `poll_ready` function returns
`NotReady` when the load balancer has no ready endpoints. However, our
fallback strategy added in #248 requires the load balancer to accept a
request when it has no endpoints, so that it can return the error
indicating that the fallback router should be used. This branch changes
the proxy's `http::balance::Service`'s `poll_ready` implementation to
return `Ok(Async::Ready(()))` when it is in the no endpoints state, so
that the request will fall back.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
olix0r added a commit to linkerd/linkerd2 that referenced this pull request May 16, 2019
commit b27dfb2d21aa8ca5466ea0edce17d27094ace7c1
Author: Takanori Ishibashi <takanori.1112@gmail.com>
Date:   Wed May 15 05:58:42 2019 +0900

    updaes->updates (#250)

    Signed-off-by: Takanori Ishibashi <takanori.1112@gmail.com>

commit 16441c25a9d423a6ab12b689b830d9ae3798fa00
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Tue May 14 14:40:03 2019 -0700

     Pass router::Config directly to router::Layer (#253)

    Currently, router `layer`s are constructed with a single argument, a
    type implementing `Recognize`. Then, the entire router stack is built
    with a `router::Config`. However, in #248, it became necessary to
    provide the config up front when constructing the `router::layer`, as
    the layer is used in a fallback layer. Rather than providing a separate
    type for a preconfigured layer, @olix0r suggested we simply change all
    router layers to accept the `Config` when they're constructed (see
    linkerd/linkerd2-proxy#248 (comment)).

    This branch changes `router::Layer` to accept the config up front. The
    `router::Stack` types `make` function now requires no arguments, and the
    implementation of `Service` for `Stack` can be called with any `T` (as
    the target is now ignored).

    Signed-off-by: Eliza Weisman <eliza@buoyant.io>

commit b70c68d4504a362eac6a7828039a2e5c7fcd308a
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Wed May 15 13:14:04 2019 -0700

    Load balancers fall back to ORIG_DST when no endpoints exist (#248)

    Currently, when no endpoints exist in the load balancer for a
    destination, we fail the request. This is because we expect endpoints to
    be discovered by both destination service queries _and_ DNS lookups, so
    if there are no endpoints for a destination, it is assumed to not exist.

    In #2661, we intend to remove the DNS lookup from the
    proxy and instead fall back to routing requests for which no endpoints
    exist in the destination service to their SO_ORIGINAL_DST IP address.
    This means that the current approach of failing requests when the load
    balancer has no endpoints will no longer work.

    This branch introduces a generic `fallback` layer, which composes a
    primary and secondary service builder into a new layer. The primary
    service can fail requests with an error type that propages the original
    request, allowing the fallback middleware to call the fallback service
    with the same request. Other errors returned by the primary service are
    still propagated upstream.

    In contrast to the approach used in #240, this fallback middleware is
    generic and not tied directly to a load balancer or a router, and can
    be used for other purposes in the future. It relies on the router cache
    eviction added in #247 to drain the router when it is not being used,
    rather than proactively destroying the router when endpoints are
    available for the lb, and re-creating it when they exist again.

    A new trait, `HasEndpointStatus`, is added in order to allow the
    discovery lookup to communicate the "no endpoints" state to the
    balancer. In addition, we add a new `Update::NoEndpoints` variant to
    `proxy::resolve::Update`, so that when the control plane sends a no
    endpoints update, we switch from the balancer to the no endpoints state
    _immediately_, rather than waiting for all the endpoints to be
    individually removed. When the balancer has no endpoints, it fails all
    requests with a fallback error, so that the fallback middleware

    A subsequent PR (#248) will remove the DNS lookups from the discovery
    module.

    Closes #240.

    Signed-off-by: Eliza Weisman <eliza@buoyant.io>

commit 6525b0638ad18e74510f3156269e0613f237e2f5
Author: Zahari Dichev <zaharidichev@gmail.com>
Date:   Wed May 15 23:35:09 2019 +0300

    Allow disabling tap by setting an env var (#252)

    This PR fixes #2811. Now if
    `LINKERD2_PROXY_TAP_DISABLED` is set, the tap is not served at all. The
    approach taken is that  the `ProxyParts` is changed so the
    `control_listener` is now an `Option` that will be None if tap is
    disabled as this control_listener seems to be exclusively used to serve
    the tap. Feel free to suggest a better approach.

    Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

commit 91f32db2ea6d74470fd689c713ff87dc7586222d
Author: Zahari Dichev <zaharidichev@gmail.com>
Date:   Thu May 16 00:45:23 2019 +0300

    Assert that outbound TLS works before identity is certified (#251)

    This commit introduces TLS capabilities to the support server as well as
    tests to ensure that outbound TLS works even when there is no verified
    certificate for the proxy yet.

    Fixes #2599

    Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

commit 45aadc6b1b28e6daea0c40e694a86ae518887d85
Author: Sean McArthur <sean@buoyant.io>
Date:   Wed May 15 14:25:39 2019 -0700

    Update h2 to v0.1.19

    Includes a couple HPACK fixes

    Signed-off-by: Sean McArthur <sean@buoyant.io>

commit 3e0e00c6dfbf5a9155b887cfd594f611edfc135f
Author: Oliver Gould <ver@buoyant.io>
Date:   Thu May 16 08:11:06 2019 -0700

    Update mio to 0.6.17 (#257)

    To pick up tokio-rs/mio#939
olix0r added a commit to linkerd/linkerd2 that referenced this pull request May 16, 2019
commit b27dfb2d21aa8ca5466ea0edce17d27094ace7c1
Author: Takanori Ishibashi <takanori.1112@gmail.com>
Date:   Wed May 15 05:58:42 2019 +0900

    updaes->updates (#250)

    Signed-off-by: Takanori Ishibashi <takanori.1112@gmail.com>

commit 16441c25a9d423a6ab12b689b830d9ae3798fa00
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Tue May 14 14:40:03 2019 -0700

     Pass router::Config directly to router::Layer (#253)

    Currently, router `layer`s are constructed with a single argument, a
    type implementing `Recognize`. Then, the entire router stack is built
    with a `router::Config`. However, in #248, it became necessary to
    provide the config up front when constructing the `router::layer`, as
    the layer is used in a fallback layer. Rather than providing a separate
    type for a preconfigured layer, @olix0r suggested we simply change all
    router layers to accept the `Config` when they're constructed (see
    linkerd/linkerd2-proxy#248 (comment)).

    This branch changes `router::Layer` to accept the config up front. The
    `router::Stack` types `make` function now requires no arguments, and the
    implementation of `Service` for `Stack` can be called with any `T` (as
    the target is now ignored).

    Signed-off-by: Eliza Weisman <eliza@buoyant.io>

commit b70c68d4504a362eac6a7828039a2e5c7fcd308a
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Wed May 15 13:14:04 2019 -0700

    Load balancers fall back to ORIG_DST when no endpoints exist (#248)

    Currently, when no endpoints exist in the load balancer for a
    destination, we fail the request. This is because we expect endpoints to
    be discovered by both destination service queries _and_ DNS lookups, so
    if there are no endpoints for a destination, it is assumed to not exist.

    In #2661, we intend to remove the DNS lookup from the
    proxy and instead fall back to routing requests for which no endpoints
    exist in the destination service to their SO_ORIGINAL_DST IP address.
    This means that the current approach of failing requests when the load
    balancer has no endpoints will no longer work.

    This branch introduces a generic `fallback` layer, which composes a
    primary and secondary service builder into a new layer. The primary
    service can fail requests with an error type that propages the original
    request, allowing the fallback middleware to call the fallback service
    with the same request. Other errors returned by the primary service are
    still propagated upstream.

    In contrast to the approach used in #240, this fallback middleware is
    generic and not tied directly to a load balancer or a router, and can
    be used for other purposes in the future. It relies on the router cache
    eviction added in #247 to drain the router when it is not being used,
    rather than proactively destroying the router when endpoints are
    available for the lb, and re-creating it when they exist again.

    A new trait, `HasEndpointStatus`, is added in order to allow the
    discovery lookup to communicate the "no endpoints" state to the
    balancer. In addition, we add a new `Update::NoEndpoints` variant to
    `proxy::resolve::Update`, so that when the control plane sends a no
    endpoints update, we switch from the balancer to the no endpoints state
    _immediately_, rather than waiting for all the endpoints to be
    individually removed. When the balancer has no endpoints, it fails all
    requests with a fallback error, so that the fallback middleware

    A subsequent PR (#248) will remove the DNS lookups from the discovery
    module.

    Closes #240.

    Signed-off-by: Eliza Weisman <eliza@buoyant.io>

commit 6525b0638ad18e74510f3156269e0613f237e2f5
Author: Zahari Dichev <zaharidichev@gmail.com>
Date:   Wed May 15 23:35:09 2019 +0300

    Allow disabling tap by setting an env var (#252)

    This PR fixes #2811. Now if
    `LINKERD2_PROXY_TAP_DISABLED` is set, the tap is not served at all. The
    approach taken is that  the `ProxyParts` is changed so the
    `control_listener` is now an `Option` that will be None if tap is
    disabled as this control_listener seems to be exclusively used to serve
    the tap. Feel free to suggest a better approach.

    Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

commit 91f32db2ea6d74470fd689c713ff87dc7586222d
Author: Zahari Dichev <zaharidichev@gmail.com>
Date:   Thu May 16 00:45:23 2019 +0300

    Assert that outbound TLS works before identity is certified (#251)

    This commit introduces TLS capabilities to the support server as well as
    tests to ensure that outbound TLS works even when there is no verified
    certificate for the proxy yet.

    Fixes #2599

    Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

commit 45aadc6b1b28e6daea0c40e694a86ae518887d85
Author: Sean McArthur <sean@buoyant.io>
Date:   Wed May 15 14:25:39 2019 -0700

    Update h2 to v0.1.19

    Includes a couple HPACK fixes

    Signed-off-by: Sean McArthur <sean@buoyant.io>

commit 3e0e00c6dfbf5a9155b887cfd594f611edfc135f
Author: Oliver Gould <ver@buoyant.io>
Date:   Thu May 16 08:11:06 2019 -0700

    Update mio to 0.6.17 (#257)

    To pick up tokio-rs/mio#939
hawkw added a commit that referenced this pull request May 23, 2019
This branch removes the fall-back-to-DNS behaviour from the proxy's
destination module. Instead, when the Destination service returns "no
endpoints", the proxy will route the request to its original
destination, using the fallback behaviour introduced in #248.

In addition, the proxy now no longer reconnects to the destination
service for queries which receive a response with the Invalid Argument
status code. After linkerd/linkerd2#2664, this status is used by the
control plane to indicate names which should not query the destination
service. I've added a new test for this.

Furthermore (and unlike #245), this branch also removes the caching of
service discovery lookups so that they can be shared between resolvers
for authorities with the same canonical form. When we limited the number
of concurrently in flight destination service queries to 100, reusing
them was much more important; now it adds a lot of unnecessary
complexity.

The rewritten `control::destination` module is now much simpler and
(hopefully) easier to understand. The `Resolver` holds a destination
service `Client`, which is cloned every time a new `Resolution` is
requested. Each active `Resolution` has a corresponding background task
that drives the destination service query and sends updates through a
channel to the `Resolver`. 

The background task is necessary so that updates are processed as they
are recieved, rather than only when the `Resolution` is polled (i.e.
when the load balancer is routing a request). In addition, the channel
provides the queueing necessary to transform the control plane's update
types (which contain multiple endpoint addresses) into multiple
`tower-balance` updates for single endpoints, and to invalidate stale
endpoints when the control plane connection is reestablished. When a
control plane query is not necessary because the requested authority
does not match the proxy's destination suffixes, the background task is
not spawned and the `Resolution` remains permanently in the "no
endpoints" state; and when an `InvalidArgument` error code is recieved,
the background task terminates.

A majority of the service discovery tests pass with no changes after
this change. The following changes were necessary: 
* The test `outbound_updates_newer_services` was removed, as it tests 
  that the same destination service query is shared between multiple
  services with the same authority (which is no longer the case).
* The mock destination service was changed to send the `Unavailable`
  status code when it has run out of expected destinations, rather than
  `InvalidArgument`, as that code is used to indicate that a destination
  should never be reconnected. 
* New tests were added for falling back to ORIG_DST when no endpoints 
  exist, and for not reconnecting after an `InvalidArgument` error.

Closes #245.
Closes linkerd/linkerd2#2661
Closes linkerd/linkerd2#2728

Signed-off-by: Eliza Weisman eliza@buoyant.io
@olix0r olix0r deleted the liza/fallback branch August 17, 2019 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants