-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eagerly drop services from the router cache #247
Conversation
492c1f0
to
8fc747b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely on the right track --- most of my feedback is style-related.
@hawkw As we discussed, your comments led to me realizing some these tests should have been panicking when instead they were not. This was because they were not being run in a tokio runtime. I've fixed them all to do that as well as switched the ones that need to block to all use |
Currently, when no endpoints exist in the load balancer for a destination, we fail the request. This is because we expect endpoints to be discovered by both destination service queries _and_ DNS lookups, so if there are no endpoints for a destination, it is assumed to not exist. In linkerd/linkerd2#2661, we intend to remove the DNS lookup from the proxy and instead fall back to routing requests for which no endpoints exist in the destination service to their SO_ORIGINAL_DST IP address. This means that the current approach of failing requests when the load balancer has no endpoints will no longer work. This branch introduces a generic `fallback` layer, which composes a primary and secondary service builder into a new layer. The primary service can fail requests with an error type that propages the original request, allowing the fallback middleware to call the fallback service with the same request. Other errors returned by the primary service are still propagated upstream. In contrast to the approach used in #240, this fallback middleware is generic and not tied directly to a load balancer or a router, and can be used for other purposes in the future. It relies on the router cache eviction added in #247 to drain the router when it is not being used, rather than proactively destroying the router when endpoints are available for the lb, and re-creating it when they exist again. A new trait, `HasEndpointStatus`, is added in order to allow the discovery lookup to communicate the "no endpoints" state to the balancer. In addition, we add a new `Update::NoEndpoints` variant to `proxy::resolve::Update`, so that when the control plane sends a no endpoints update, we switch from the balancer to the no endpoints state _immediately_, rather than waiting for all the endpoints to be individually removed. When the balancer has no endpoints, it fails all requests with a fallback error, so that the fallback middleware A subsequent PR (#248) will remove the DNS lookups from the discovery module. Closes #240. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
65c7128
to
99430b4
Compare
99430b4
to
1ffd539
Compare
0a8e67c
to
124a72a
Compare
In addressing @seanmonstar comment about using The core difference is that locking the router cache has moved into the The states of
In summary, the router cache lock acquisition has been moved into the tokio runtime and should result in less spins and lock contention (when we someday default to a multithreaded runtime). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a preliminary pass. I haven't taken a thorough look at all of this code yet, but I wanted to leave the comments that I've got so far.
Also, be sure to merge master into this when you get a chance. It merges cleanly. |
2646f26
to
7839264
Compare
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good! Mostly focusing on docs & clarity, now... I think we're close to merging this. Note that you'll need to merge master, though!
lib/router/src/cache.rs
Outdated
/// | ||
/// ## Complexity | ||
/// Complexity: | ||
/// - `access` in **O(1)** time (amortized average) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and here.
(I don't think RustDoc will format this correctly; IIRC anything indented four spaces is formatted as code by the markdown parser)
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
…cache-purge Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
I'm trying to devise a manual test plan for this... Essentially, we need to have a client that sends a request to a service and then sleeps for twice the max idle age before repeating the process. We'd want to manually confirm (probably via logs, for now?) that we see the service dropped and re-inserted upon the next request... This also raises some follow-up issues we should open:
|
It looks like your most recent build failed due to rustfmt reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, provided we can manually verify that it's working correctly. Commented on some non-blocking nits.
{ | ||
/// A router that is created with a fixed set of known routes. | ||
/// | ||
/// `recognize` will only produce targets that exist in `routes`. We never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I might change this to say
/// `recognize` will only produce targets that exist in `routes`. We never | |
/// `recognize` is expected to only produce targets that exist in `routes`. We never |
we can't actually enforce this; it's a contract that we expect the caller to adhere to.
/// A router that is created with a fixed set of known routes. | ||
/// | ||
/// `recognize` will only produce targets that exist in `routes`. We never | ||
/// want to expire routes from the fixed set; the explicit drop of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit/TIOLI: I might also rephrase this to focus on the behaviour of the fixed router --- something like
"Because the set of routes will not change, routes will not be expired from the router cache, and no background task is necessary to purge it." --- I think Oliver's suggestion that we comment about the explicit drop of the purge task is more of an implementation detail that I'd put in the function body.
This is a nit & non-blocking, I think the comment is okay as-is.
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
@olix0r @hawkw Above you will see the change I had to make to ensure the cache purge task continues to poll in the background. I confirmed services are being purged from the router cache as we would expect with this change. I setup an environment based off slow-cooker. There is a client that makes requests a single curl request every 4 seconds. I set the The following abbreviated logs show the repeated scenario:
Here is the gist with the deployment I used. To replicate you'll need to:
After that you can |
I also have this proxy image continuing to run this purge deployment on GKE |
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
self.recognize.clone(), | ||
self.inner.clone(), | ||
self.config.capacity, | ||
self.config.max_idle_age, | ||
); | ||
|
||
let ctx = logging::Section::Proxy.bg(self.config.proxy_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this log context will format like this:
proxy={bg=foo_router}
I think it might be better to wrap the proxy_name
field in a type that formats like
"cache_purge router={}", proxy_name
so that the resulting log lines look like
proxy={bg=cache_purge router=foo_router}
or something. This might be useful in case there are ever other background contexts associated with routers.
On the other hand, we can always change this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely some follow up around this change regarding metrics around the router. I will probably address this then when I do that. These would also be at the trace
level so I don't think it's too big an issue right now as searching off the trace
messages is fairly helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to merge this in its current state as soon as your manual testing confirms that everything is working correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I suggest that we move the particularly relevant logging up to debug!
before merging so we don't have to sift through the copious trace logging to get a signal about these somewhat rare events.
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
commit b27dfb2 Author: Takanori Ishibashi <takanori.1112@gmail.com> Date: Wed May 15 05:58:42 2019 +0900 updaes->updates (linkerd#250) Signed-off-by: Takanori Ishibashi <takanori.1112@gmail.com> commit 16441c2 Author: Eliza Weisman <eliza@buoyant.io> Date: Tue May 14 14:40:03 2019 -0700 Pass router::Config directly to router::Layer (linkerd#253) Currently, router `layer`s are constructed with a single argument, a type implementing `Recognize`. Then, the entire router stack is built with a `router::Config`. However, in linkerd#248, it became necessary to provide the config up front when constructing the `router::layer`, as the layer is used in a fallback layer. Rather than providing a separate type for a preconfigured layer, @olix0r suggested we simply change all router layers to accept the `Config` when they're constructed (see linkerd#248 (comment)). This branch changes `router::Layer` to accept the config up front. The `router::Stack` types `make` function now requires no arguments, and the implementation of `Service` for `Stack` can be called with any `T` (as the target is now ignored). Signed-off-by: Eliza Weisman <eliza@buoyant.io> commit b70c68d Author: Eliza Weisman <eliza@buoyant.io> Date: Wed May 15 13:14:04 2019 -0700 Load balancers fall back to ORIG_DST when no endpoints exist (linkerd#248) Currently, when no endpoints exist in the load balancer for a destination, we fail the request. This is because we expect endpoints to be discovered by both destination service queries _and_ DNS lookups, so if there are no endpoints for a destination, it is assumed to not exist. In linkerd/linkerd2#2661, we intend to remove the DNS lookup from the proxy and instead fall back to routing requests for which no endpoints exist in the destination service to their SO_ORIGINAL_DST IP address. This means that the current approach of failing requests when the load balancer has no endpoints will no longer work. This branch introduces a generic `fallback` layer, which composes a primary and secondary service builder into a new layer. The primary service can fail requests with an error type that propages the original request, allowing the fallback middleware to call the fallback service with the same request. Other errors returned by the primary service are still propagated upstream. In contrast to the approach used in linkerd#240, this fallback middleware is generic and not tied directly to a load balancer or a router, and can be used for other purposes in the future. It relies on the router cache eviction added in linkerd#247 to drain the router when it is not being used, rather than proactively destroying the router when endpoints are available for the lb, and re-creating it when they exist again. A new trait, `HasEndpointStatus`, is added in order to allow the discovery lookup to communicate the "no endpoints" state to the balancer. In addition, we add a new `Update::NoEndpoints` variant to `proxy::resolve::Update`, so that when the control plane sends a no endpoints update, we switch from the balancer to the no endpoints state _immediately_, rather than waiting for all the endpoints to be individually removed. When the balancer has no endpoints, it fails all requests with a fallback error, so that the fallback middleware A subsequent PR (linkerd#248) will remove the DNS lookups from the discovery module. Closes linkerd#240. Signed-off-by: Eliza Weisman <eliza@buoyant.io> commit 6525b06 Author: Zahari Dichev <zaharidichev@gmail.com> Date: Wed May 15 23:35:09 2019 +0300 Allow disabling tap by setting an env var (linkerd#252) This PR fixes linkerd/linkerd2#2811. Now if `LINKERD2_PROXY_TAP_DISABLED` is set, the tap is not served at all. The approach taken is that the `ProxyParts` is changed so the `control_listener` is now an `Option` that will be None if tap is disabled as this control_listener seems to be exclusively used to serve the tap. Feel free to suggest a better approach. Signed-off-by: Zahari Dichev <zaharidichev@gmail.com> commit 91f32db Author: Zahari Dichev <zaharidichev@gmail.com> Date: Thu May 16 00:45:23 2019 +0300 Assert that outbound TLS works before identity is certified (linkerd#251) This commit introduces TLS capabilities to the support server as well as tests to ensure that outbound TLS works even when there is no verified certificate for the proxy yet. Fixes linkerd/linkerd2#2599 Signed-off-by: Zahari Dichev <zaharidichev@gmail.com> commit 45aadc6 Author: Sean McArthur <sean@buoyant.io> Date: Wed May 15 14:25:39 2019 -0700 Update h2 to v0.1.19 Includes a couple HPACK fixes Signed-off-by: Sean McArthur <sean@buoyant.io> commit 3e0e00c Author: Oliver Gould <ver@buoyant.io> Date: Thu May 16 08:11:06 2019 -0700 Update mio to 0.6.17 (linkerd#257) To pick up tokio-rs/mio#939
This change introduces an eviction task for router caches. Few changes were made
in how a router uses a cache; we still use
access
andreserve
handles forcache interactions.
Current issue
The issue this change addresses is that services are only dropped from a router
cache when we have reached that cache's capacity.
We drop a service when it has not received a request within the
max_idle_age
.When the cache is full, we scan through all the services and check their last
access time. If a service's last access time is longer than
max_idle_age
, thenit is removed.
This strategy is problematic because a service is only removed once the capacity
has been reached. If the capacity of a cache is never reached, then a service's
stack is kept around for the lifetime of that router.
Proposed solution
When a router is created, we now also create a handle to that router's purge
task (the
PurgeCache
struct). If a router should attempt to purge its servicesin the background, it can spawn that handle when the router is created.
A cache's state is now internally maintained by a
tokio_timer::DelayQueue.
When
PurgeCache
is able to pop services from the queue, we know that thoseservices have not received a request within an
expires
span of time -- even ifthe cache has not reached it's capacity.
An important thing to note about this strategy is we still rely on locking the
entire cache when holding an
access
/reserve
handle, as well as now purgingit in the background. The purge task will only ever attempt to lock the cache
with
try_lock!
, this ensures we do not cause lock contention with a moreimportant
access
/reserve
interaction.Another important note is that
PurgeCache
only has a weak reference to therouter cache. If the router is dropped and a purge task was previously spawned
on the runtime, then it will not continue to try and purge cache values.
Note: Implementation originally sketched out
here
Signed-off-by: Kevin Leimkuhler kevin@kleimkuhler.com