Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom dnsPolicy and dnsConfig #62

Closed
briansmith opened this issue Dec 18, 2017 · 34 comments
Closed

Support custom dnsPolicy and dnsConfig #62

briansmith opened this issue Dec 18, 2017 · 34 comments

Comments

@briansmith
Copy link
Contributor

For background on Kubernetes dnsPolicy and dnsConfig, see https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pods-dns-policy

For any pod managed by Conduit, the DNS configuration is completely bypassed. Assuming we want to continue to do this, we should:

  1. Document this in the documentation.
  2. In conduit inject, warn when a pod contains dnsConfig or a non-default dnsPolicy. (Note that the default dnsPolicy is not "Default"; the default is "ClusterFirst."

We might also consider whether we want to actually honor the pod's DNS policy and/or the DNS config. This would probably require us to implement DNS in the proxy.

@briansmith
Copy link
Contributor Author

See also http://blog.kubernetes.io/2017/04/configuring-private-dns-zones-upstream-nameservers-kubernetes.html for some interesting details on this issue. We should consider allowing the controller Destination service to be configured with a ConfigMap like that.

@briansmith briansmith added this to the Conduit 0.3 milestone Feb 12, 2018
@briansmith briansmith modified the milestones: Conduit 0.3, Conduit 0.4 Feb 12, 2018
@briansmith briansmith changed the title Clarify that Conduit overrides pod's dnsPolicy and dnsConfig conduit inject should inject into pods that use incompatible dnsPolicy and dnsConfig Feb 17, 2018
@briansmith
Copy link
Contributor Author

Based on the solution to #366, we should change course. Instead of documenting that Conduit would potentially break pods using incompatible DNS configuration, we should instead by default avoid injecting the proxy into such pods. We should provide a documented way to get the proxy working with such pods. For example, we could document that one has to remove the incompatible DNS settings. And/or we could implement & document an explicit pod annotation that would override our default decision to assume that such DNS configurations are incompatible.

@briansmith briansmith changed the title conduit inject should inject into pods that use incompatible dnsPolicy and dnsConfig conduit inject should not inject into pods that use incompatible dnsPolicy and dnsConfig Feb 17, 2018
@briansmith briansmith modified the milestones: Conduit 0.4, Conduit 0.3 Feb 19, 2018
@briansmith briansmith self-assigned this Feb 19, 2018
@briansmith
Copy link
Contributor Author

Based on the investigation in #392 and the current direction of things as described in the conversation at the end of #360 I think it's best to avoid doing anything for this for 0.3. Basically, if we spend the effort to, we should be able to transparently support any DNS policy, automatically. It'd be unfortunate that 0.3 wouldn't do that, but this isn't the most significant DNS transparency issue in 0.3, and we might fix the overall issue in 0.4.

@briansmith briansmith modified the milestones: Conduit 0.3, Conduit 0.4 Feb 20, 2018
@olix0r olix0r modified the milestones: Conduit 0.4, Conduit 0.3.1 Feb 21, 2018
@olix0r olix0r added the priority/P1 Planned for Release label Feb 23, 2018
@olix0r
Copy link
Member

olix0r commented Feb 26, 2018

We should hold off on this until #421 is done

@olix0r olix0r added the priority/P2 Nice-to-have for Release label Feb 26, 2018
@wmorgan
Copy link
Member

wmorgan commented May 21, 2018

For completeness, the errors that @jakerobb reported on Slack were:

ERR! conduit_proxy::control "controller-client", controller error: Error attempting to establish underlying session layer: operation timed out after 3s
WARN trust_dns_proto::dns_handle error notifying wait, possible future leak: Err(ResolveError(Proto(Timeout), (None, stack backtrace:
   0:     0x562b82671e0c - backtrace::backtrace::trace::hef23eb1bdedb9de4
   1:     0x562b826710a2 - backtrace::capture::Backtrace::new::h570f24a892cd6a34
   2:     0x562b8263b917 - <trust_dns_proto::error::ProtoError as core::convert::From<trust_dns_proto::error::ProtoErrorKind>>::from::h043748def247281b
   3:     0x562b8262fec2 - _ZN110_$LT$trust_dns_proto..dns_handle..DnsFuture$LT$S$C$$u20$E$C$$u20$MF$GT$$u20$as$u20$futures..future..Future$GT$4poll17h37890051da49b999E.llvm.17885337389023634268
   4:     0x562b825ffbd0 - <futures::future::chain::Chain<A, B, C>>::poll::h046c15f0cba805b8
   5:     0x562b82627667 - <futures::future::map_err::MapErr<A, F> as futures::future::Future>::poll::hbeb65d90d09deaf4
   6:     0x562b826c5fbf - futures::task_impl::std::set::h0eda14c4de187820
   7:     0x562b826c344f - <scoped_tls::ScopedKey<T>>::set::h0efdea05668e906c
   8:     0x562b826b9b7f - tokio_core::reactor::Core::poll::hfaacf20236fbcc81
   9:     0x562b8249e643 - tokio_core::reactor::Core::run::hdb7e20c27c731447
  10:     0x562b824deda8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h808bc1a152c1d7fa
  11:     0x562b82412455 - _ZN3std9panicking3try7do_call17h968da37b516bd463E.llvm.6120476831782816583
  12:     0x562b8272882e - __rust_maybe_catch_panic
                        at libpanic_unwind/lib.rs:102
  13:     0x562b824157fa - <F as alloc::boxed::FnBox<A>>::call_box::hb783e783e73f8fc2
  14:     0x562b8272076b - <alloc::boxed::Box<alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h598f9713c9cb9093
                        at /checkout/src/liballoc/boxed.rs:798
                         - std::sys_common::thread::start_thread::h65eb43a1201d41e6
                        at libstd/sys_common/thread.rs:24
                         - std::sys::unix::thread::Thread::new::thread_start::h711c51a13a158afa
                        at libstd/sys/unix/thread.rs:90
  15:     0x7f0749c81063 - start_thread
  16:     0x7f07497a062c - clone
  17:                0x0 - <unknown>)))
ERR! conduit_proxy::control "controller-client", controller error: Error attempting to establish underlying session layer: operation timed out after 3s
WARN trust_dns_proto::dns_handle error notifying wait, possible future leak: Err(ResolveError(Proto(Timeout), (None, stack backtrace:
   0:     0x562b82671e0c - backtrace::backtrace::trace::hef23eb1bdedb9de4
   1:     0x562b826710a2 - backtrace::capture::Backtrace::new::h570f24a892cd6a34
   2:     0x562b8263b917 - <trust_dns_proto::error::ProtoError as core::convert::From<trust_dns_proto::error::ProtoErrorKind>>::from::h043748def247281b
   3:     0x562b8262fec2 - _ZN110_$LT$trust_dns_proto..dns_handle..DnsFuture$LT$S$C$$u20$E$C$$u20$MF$GT$$u20$as$u20$futures..future..Future$GT$4poll17h37890051da49b999E.llvm.17885337389023634268
   4:     0x562b825ffbd0 - <futures::future::chain::Chain<A, B, C>>::poll::h046c15f0cba805b8
   5:     0x562b82627667 - <futures::future::map_err::MapErr<A, F> as futures::future::Future>::poll::hbeb65d90d09deaf4
   6:     0x562b826c5fbf - futures::task_impl::std::set::h0eda14c4de187820
   7:     0x562b826c344f - <scoped_tls::ScopedKey<T>>::set::h0efdea05668e906c
   8:     0x562b826b9b7f - tokio_core::reactor::Core::poll::hfaacf20236fbcc81
   9:     0x562b8249e643 - tokio_core::reactor::Core::run::hdb7e20c27c731447
  10:     0x562b824deda8 - std::sys_common::backtrace::__rust_begin_short_backtrace::h808bc1a152c1d7fa
  11:     0x562b82412455 - _ZN3std9panicking3try7do_call17h968da37b516bd463E.llvm.6120476831782816583
  12:     0x562b8272882e - __rust_maybe_catch_panic
                        at libpanic_unwind/lib.rs:102
  13:     0x562b824157fa - <F as alloc::boxed::FnBox<A>>::call_box::hb783e783e73f8fc2
  14:     0x562b8272076b - <alloc::boxed::Box<alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h598f9713c9cb9093
                        at /checkout/src/liballoc/boxed.rs:798
                         - std::sys_common::thread::start_thread::h65eb43a1201d41e6
                        at libstd/sys_common/thread.rs:24
                         - std::sys::unix::thread::Thread::new::thread_start::h711c51a13a158afa
                        at libstd/sys/unix/thread.rs:90
  15:     0x7f0749c81063 - start_thread
  16:     0x7f07497a062c - clone
  17:                0x0 - <unknown>)))
ERR! conduit_proxy turning Error caused by underlying HTTP/2 error: protocol error: unexpected internal error encountered into 500

@olix0r
Copy link
Member

olix0r commented May 21, 2018

We just updated our trust-dns dependency at the end of last week, so we'll retest this and work with upstream to address the issue if it still exists.

@olix0r olix0r assigned hawkw and unassigned briansmith May 21, 2018
@jakerobb
Copy link

jakerobb commented May 22, 2018

Maybe the rest of you already realize this, but I've found a workaround for this issue that works as long as the hosts you need to resolve have static IP addresses. In the pod template section of your deployment YAML, under the "spec" element, you can create a 'hostAliases' element and define entries which will land in the container's /etc/hosts file, essentially circumventing DNS entirely. Here's mine:

spec:
  hostAliases:
  - ip: "10.0.7.14"
    hostnames:
    - "cassandra01"
  - ip: "10.0.7.15"
    hostnames:
    - "cassandra02"
  - ip: "10.0.7.27"
    hostnames:
    - "cassandra03"

Hope this helps someone!

@wmorgan
Copy link
Member

wmorgan commented May 22, 2018

Great to know. Thanks @jakerobb

@briansmith
Copy link
Contributor Author

cassandra01. 3600 IN A 10.0.7.14

The most likely reason it failed: Maybe Trust-DNS won't fall back to resolving the single-label unqualified name X as "X." after it tries everything else in the search path? I suggest we add this to Trust-DNS's unit tests and see what happens.

@wmorgan
Copy link
Member

wmorgan commented May 24, 2018

Maybe @bluejekyll knows?

@bluejekyll
Copy link

The issue @briansmith is mentioning around search order and ndots was fixed in both 0.8.2 and 0.9 of the Resolver.

There is another issue of NameServer starvation that was only fixed in 0.9: https://github.com/bluejekyll/trust-dns/pull/457

Which version are you on currently?

@hawkw hawkw mentioned this issue Jun 1, 2018
@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

@bluejekyll this error was observed using 0.8.2 (I checked Cargo.lock and it looks like we were pulling in bluejekyll/trust-dns@ce6952c).

I'm planning on doing some testing and seeing if this also occurs with trust-dns-resolver 0.9+.

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

Interesting, after updating to trust-dns-resolver I now see

ERR! admin={bg=resolver} conduit_proxy::control controller error: Error attempting to establish underlying session layer: operation timed out after 3s
WARN trust_dns_proto::xfer error notifying wait, possible future leak: Err(ResolveError { inner: ProtoError { inner:  request timed out })
ERR! admin={bg=resolver} conduit_proxy::control controller error: Error attempting to establish underlying session layer: operation timed out after 3s
WARN trust_dns_proto::xfer error notifying wait, possible future leak: Err(ResolveError { inner: ProtoError { inner: request timed out })

in the logs from a pod with

    spec:
      dnsPolicy: "Default"
      containers:
      - ...

curling an external name fails with HTTP 500, but it appears after upgrading trust-dns-resolver, the proxy can no longer resolve cluster-local names with dnsPolicy: Default either.

dnsPolicy: ClusterFirst still works fine with the latest trust-dns-resolver.

@bluejekyll
Copy link

bluejekyll commented Jun 1, 2018

error notifying wait, possible future leak usually means the handle to the lookup went away, though that may be a redherring given the timeout message.

I'm not familiar with this dnsPolicy: Default vs. ClusterFirst, what is that doing?

Edit: I just reviewed the docs,

  • “Default“: The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details.
  • “ClusterFirst“: Any DNS query that does not match the configured cluster domain suffix, such as “www.kubernetes.io”, is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases.
  • “ClusterFirstWithHostNet“: For Pods running with hostNetwork, you should explicitly set its DNS policy “ClusterFirstWithHostNet”.
  • “None“: A new option value introduced in Kubernetes v1.9 (Beta in v1.10). It allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using the dnsConfig field in the Pod Spec. See DNS config subsection below.

Are these settings being passed directly into trust-dns-resolver config? meaning, is there some ordering required in the NameServers in the Resolver instance?

@briansmith
Copy link
Contributor Author

We need to put #1032 into the next release and then retest this. I think #1032 will make our DNS stuff much more reasonable to think about, which is why it was prioritized.

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

@briansmith the results above are from a build of Conduit with #1032.

I'm planning on doing some additional digging into this issue.

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

Nevermind, just re-ran the tests with the master build of conduit, looks like cluster-local names were always broken with custom dnsPolicy configurations, prior to the trust-dns-resolver update. I must have failed to document that when I was doing the earlier testing.

@briansmith
Copy link
Contributor Author

cluster-local names were always broken

To clarify, they weren't working yet for custom configurations that aren't using ClusterFirst policy, so this issue isn't resolved yet. They've been working fine for in the default DNS configuration.

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

@briansmith Yes, that's correct. Updated my original comment to make that clearer.

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

@bluejekyll
Thanks for following up & looking into the K8s docs!

Are these settings being passed directly into trust-dns-resolver config? meaning, is there some ordering required in the NameServers in the Resolver instance?

This is in fact, what I was going to start looking into next.

@bluejekyll
Copy link

If constant ordering is required, we'll have to look into adding that as an option to the NameServerPool. Right now the nameservers are reordered base on performance: https://github.com/bluejekyll/trust-dns/blob/master/resolver/src/name_server_pool.rs#L624

If this is undesirable, we should disable this sort_unstable with some config option. We'd also need to review the ordering and make sure the set of NameServers is consistent with the passed in configuration. This should be easy now, since I switched this to be a plain Vec away from the BinaryHeap before: https://github.com/bluejekyll/trust-dns/blob/master/resolver/src/name_server_pool.rs#L448-L449

@hawkw
Copy link
Member

hawkw commented Jun 1, 2018

@bluejekyll I did the test again using a custom build of trust-dns-resolver the sort_unstable commented out, and I'm still seeing this error. Will keep digging.

@briansmith
Copy link
Contributor Author

In the configuration above in #62 (comment) there's only one nameserver.

@hawkw hawkw removed their assignment Jul 30, 2018
@wmorgan
Copy link
Member

wmorgan commented Feb 19, 2019

Is this issue still relevant?

@hawkw
Copy link
Member

hawkw commented Feb 20, 2019

@wmorgan I'm not aware of anything that's happened recently that would fix it, but would have to test to confirm...

khappucino pushed a commit to Nordstrom/linkerd2 that referenced this issue Mar 5, 2019
trust-dns-resolver is a more complete implementation. In particular,
it supports CNAMES correctly, which is needed for PR linkerd#764. It also
supports /etc/hosts, which will help with issue linkerd#62.

Use the 0.8.2 pre-release since it hasn't been released yet. It was
created at our request.

Signed-off-by: Brian Smith <brian@briansmith.org>
@grampelberg
Copy link
Contributor

We're not bypassing DNS any longer and statefulsets work properly now.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

7 participants