Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loadbalance requests to control plane #594

Merged
merged 30 commits into from
Aug 26, 2020
Merged

Conversation

zaharidichev
Copy link
Member

@zaharidichev zaharidichev commented Jul 10, 2020

This PR enables the discovery of control plane components through DNS. This is done by implementing a resolution stream backed by a DNS resolver. The stream will yield Updates that feed the loadbalancer.

The stream is written with async-stream to avoid manual state machines. In addition to that a dns::Resolver trait is introduced to allow for easily mocking a DNS resolver and testing the stream implementation.

I have tested this and can confirm that it resolves the problems described in linkerd/linkerd2#4674 and linkerd/linkerd2#4624

Signed-off-by: Zahari Dichev zaharidichev@gmail.com

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
@zaharidichev zaharidichev marked this pull request as ready for review July 13, 2020 14:08
@zaharidichev zaharidichev requested a review from a team July 13, 2020 14:08
@@ -55,7 +69,8 @@ impl Config {
)))
.push_timeout(control.connect.timeout)
.push(control::client::layer())
.push(control::resolve::layer(dns))
.push(discover)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract all of that for identity, dst and the oc collector. If so, what would a method that produces this stack return? impl MakeService ? I had some trouble with the typo contraints

Comment on lines 106 to 107
const EWMA_DEFAULT_RTT: Duration = Duration::from_millis(30);
const EWMA_DECAY: Duration = Duration::from_secs(10);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are sensible values for all these constants and magix numbers when it comes to communicating with the control plane?

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach looks good. I had some notes on the implementation.

linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
@zaharidichev zaharidichev changed the title Load balancer requests to control plane Load balance requests to control plane Jul 13, 2020
@zaharidichev zaharidichev changed the title Load balance requests to control plane Loadbalance requests to control plane Jul 13, 2020
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this definitely feels like the right approach. I had a few more suggestions.

linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
@zaharidichev zaharidichev requested review from hawkw and a team July 21, 2020 08:24
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking really good! Walking through the tests a little more, but just had some smaller comments to add.

linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/Cargo.toml Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/app/core/src/control.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
linkerd/dns/src/lib.rs Outdated Show resolved Hide resolved
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
@olix0r
Copy link
Member

olix0r commented Aug 25, 2020

Are any control plane changes needed to test this? I seem to hit errors like:

linkerd-web-766dd9bb65-mn7zm linkerd-proxy [     0.12560880s]  WARN ThreadId(13) dst: linkerd2_reconnect::service: Service failed error=load balancer discovery error: Invalid SRV record SRV { priority: 0, weight: 100, port: 8086, target: Name { is_fqdn: true, labels: [linkerd-dst, linkerd, svc, cluster, local] } }

Are we going to have to support falling back kube-proxy?

@zaharidichev
Copy link
Member Author

zaharidichev commented Aug 25, 2020

@olix0r yes you need to make identitiy and dst headless. Here is the branch: linkerd/linkerd2@17e6490

Do we want to fall back on kube-proxy. If so, how would we do that? Just do an Ip lookup in case we error out?

* dns-resolve: Always use resets

There's no reason that we have to maintain the resolution state now that
we have a Reset type.

Furthermore, we can use a unit endpoint type, since it is ignored.

* undo errant change

* Use map_endpoint to build control client targets

* undo unnecessarty change

* lookup_service_ips => lookup_service_addrs

* dns: Run DNS resolutions on the main runtime

DNS resolutions are run on the admin runtime. This requires an
unnecessary layer of indirection around the resolver, including an MPSC.

Now that we allow the main runtime to use more than one thread, it's
preferable to do this discovery on the main runtime and we can simplify
the implementation.

* Remove needless recursion_limit settings

* Set span on background task

* Fallback from SRV records to A records when SRV records are invalid

* Fallback on each call, not only the first

* -async_stream; not pin, etc

* touchup

* instrument dns resolver

* trace log on close

* -boilerplate

* -recursion_limit
@olix0r olix0r requested a review from hawkw August 26, 2020 00:24
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much simpler now! this lgtm!

Comment on lines +167 to +169
Service = discover::MakeEndpoint<
discover::FromResolve<map_endpoint::Resolve<IntoTarget, DnsResolve>, Target>,
M,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker but could we maybe make this thing a type alias or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only used the once, isn't it? so, yes, but it wouldn't really reduce anything

@olix0r olix0r merged commit 6962794 into main Aug 26, 2020
@olix0r olix0r deleted the zd/control-plane-discover branch August 26, 2020 00:58
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 26, 2020
This release includes internal changes to the service discovery system,
especially when discovering control plane components (like the
destination and identity controllers). Now, the proxy attempts to
balance requests across all pods in each control plane service. This
requires control plane changes to use "headless" services so that SRV
records are exposed. When the control plane services have a `clusterIP`
set, the proxy falls back to using normal A-record lookups.

---

* tracing: add richer verbose spans to http clients (linkerd/linkerd2-proxy#622)
* trace: update tracing dependencies (linkerd/linkerd2-proxy#623)
* Remove `Resolution` trait (linkerd/linkerd2-proxy#606)
* Update proxy-identity to edge-20.8.2 (linkerd/linkerd2-proxy#627)
* Add build arg for skipping identity wrapper (linkerd/linkerd2-proxy#624)
* Wait for proxy thread to terminate in integration tests (linkerd/linkerd2-proxy#625)
* Remove scrubbing for unused headers (linkerd/linkerd2-proxy#628)
* Split orig-proto tests out of discovery tests (linkerd/linkerd2-proxy#629)
* Re-enable outbound timeout test (linkerd/linkerd2-proxy#630)
* profiles: perform profile resolution for IP addresses (linkerd/linkerd2-proxy#626)
* Move resolve api to async-stream (linkerd/linkerd2-proxy#599)
* Decouple discovery buffering from endpoint conversion (linkerd/linkerd2-proxy#631)
* resolve: Add a Reset state (linkerd/linkerd2-proxy#633)
* resolve: Eagerly fail resolutions (linkerd/linkerd2-proxy#634)
* test: replace `net2` dependency with `socket2` (linkerd/linkerd2-proxy#635)
* dns: Run DNS resolutions on the main runtime (linkerd/linkerd2-proxy#637)
* Load balance requests to the control plane (linkerd/linkerd2-proxy#594)
* Unify control plane client construction (linkerd/linkerd2-proxy#638)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 26, 2020
This release includes internal changes to the service discovery system,
especially when discovering control plane components (like the
destination and identity controllers). Now, the proxy attempts to
balance requests across all pods in each control plane service. This
requires control plane changes to use "headless" services so that SRV
records are exposed. When the control plane services have a `clusterIP`
set, the proxy falls back to using normal A-record lookups.

---

* tracing: add richer verbose spans to http clients (linkerd/linkerd2-proxy#622)
* trace: update tracing dependencies (linkerd/linkerd2-proxy#623)
* Remove `Resolution` trait (linkerd/linkerd2-proxy#606)
* Update proxy-identity to edge-20.8.2 (linkerd/linkerd2-proxy#627)
* Add build arg for skipping identity wrapper (linkerd/linkerd2-proxy#624)
* Wait for proxy thread to terminate in integration tests (linkerd/linkerd2-proxy#625)
* Remove scrubbing for unused headers (linkerd/linkerd2-proxy#628)
* Split orig-proto tests out of discovery tests (linkerd/linkerd2-proxy#629)
* Re-enable outbound timeout test (linkerd/linkerd2-proxy#630)
* profiles: perform profile resolution for IP addresses (linkerd/linkerd2-proxy#626)
* Move resolve api to async-stream (linkerd/linkerd2-proxy#599)
* Decouple discovery buffering from endpoint conversion (linkerd/linkerd2-proxy#631)
* resolve: Add a Reset state (linkerd/linkerd2-proxy#633)
* resolve: Eagerly fail resolutions (linkerd/linkerd2-proxy#634)
* test: replace `net2` dependency with `socket2` (linkerd/linkerd2-proxy#635)
* dns: Run DNS resolutions on the main runtime (linkerd/linkerd2-proxy#637)
* Load balance requests to the control plane (linkerd/linkerd2-proxy#594)
* Unify control plane client construction (linkerd/linkerd2-proxy#638)
@lieberlois
Copy link

Any chance that this is not fixed in non-HA mode? We had this behaviour recently after an AKS node crash, had 2 Linkerd Destination Pods, one of which was on that node 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants