Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

[v0.16.x] Allow dnsmasq to be backed by a local copy of CoreDNS #1895

Merged
merged 2 commits into from
Sep 8, 2020

Conversation

kfr2
Copy link
Contributor

@kfr2 kfr2 commented Aug 4, 2020

This commit allows the user to specify that dnsmasq should be
backed by a pod-local copy of CoreDNS rather than relying on
the global CoreDNS service. If enabled, the dnsmasq-node
DaemonSet will be configured to use a local copy of CoreDN
for its resolution while setting the global CoreDNS service as
a fallback. This is handy in situations where the number of DNS
requests within a cluster grows large and causes resolution issues
as dnsmasq reaches out to the global CoreDNS service.

Additionally, several values passed to dnsmasq are now configurable
including its --cache-size and --dns-forward-max.

See this postmortem
for an investigation into this situation which was instrumental in
understanding issues we were facing. Many thanks to dominicgunn
for providing the manifests which I codified into this commit.


These features can be enabled and tuned by setting the following
values within cluster.yaml:

kubeDns:
  dnsmasq:
    coreDNSLocal:
      # When enabled, this will run a copy of CoreDNS within each DNS-masq pod and
      # configure the utility to use it for resolution.
      enabled: true

      # Defines the resource requests/limits for the coredns-local container.
      # cpu and/or memory constraints can be removed by setting the appropriate value(s)
      # to an empty string.
      resources:
        requests:
          cpu: 50m
          memory: 100Mi
        limits:
          cpu: 50m
          memory: 100Mi

    # The size of dnsmasq's cache.
    cacheSize: 50000

    # The maximum number of concurrent DNS queries.
    dnsForwardMax: 500

    # This option gives a default value for time-to-live (in seconds) which dnsmasq
    # uses to cache negative replies even in the absence of an SOA record.
    negTTL: 60

Related:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 4, 2020
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 4, 2020
@dominicgunn
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2020
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dominicgunn
You can assign the PR to them by writing /assign @dominicgunn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dominicgunn dominicgunn added this to the v0.16.3 milestone Aug 13, 2020
Copy link
Contributor

@dominicgunn dominicgunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change required.

builtin/files/userdata/cloud-config-controller Outdated Show resolved Hide resolved
@kfr2 kfr2 force-pushed the coredns-local-v0.16.x branch 3 times, most recently from 68c58bb to 313e547 Compare August 28, 2020 13:56
This commit allows the user to specify that dnsmasq should be
backed by a pod-local copy of CoreDNS rather than relying on
the global CoreDNS service. If enabled, the dnsmasq-node
DaemonSet will be configured to use a local copy of CoreDNS
for its resolution while setting the global CoreDNS service as
a fallback. This is handy in situations where the number of DNS
requests within a cluster grows large and causes resolution issues
as dnsmasq reaches out to the global CoreDNS service.

Additionally, several values passed to dnsmasq are now configurable
including its `--cache-size` and `--dns-forward-max`.

See [this postmortem](https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md)
for an investigation into this situation which was instrumental in
understanding issues we were facing. Many thanks to dominicgunn
for providing the manifests which I codified into this commit.

---

These features can be enabled and tuned by setting the following
values within cluster.yaml:

```yaml
kubeDns:
  dnsmasq:
    coreDNSLocal:
      # When enabled, this will run a copy of CoreDNS within each DNS-masq pod and
      # configure the utility to use it for resolution.
      enabled: true

      # Defines the resource requests/limits for the coredns-local container.
      # cpu and/or memory constraints can be removed by setting the appropriate value(s)
      # to an empty string.
      resources:
        requests:
          cpu: 50m
          memory: 100Mi
        limits:
          cpu: 50m
          memory: 100Mi

    # The size of dnsmasq's cache.
    cacheSize: 50000

    # The maximum number of concurrent DNS queries.
    dnsForwardMax: 500

    # This option gives a default value for time-to-live (in seconds) which dnsmasq
    # uses to cache negative replies even in the absence of an SOA record.
    negTTL: 60
```
The dnsmasq-node ServiceAccount must exist whether or not CoreDNS-local
has been enabled. Therefore, it is created alongside the DaemonSet rather
than as part of the coredns-local manifest.

Additionally, always create dnsmasq-node-coredns-local.yaml If this file
does not exist (as would be the case if the CoreDNS local feature has
not been enabled), controller nodes will fail to come up with the error:
> error: the path "/srv/kubernetes/manifests/dnsmasq-node-coredns-local.yaml" does not exist
This is caused when `kubectl delete` is called against the file because
of the line `remove "${mfdir}/dnsmasq-node-coredns-local.yaml`.

This manifest must always be generated because the CoreDNS-local
feature cannot be enabled and then later disabled without otherwise
requiring manual operator intervention.
@dominicgunn dominicgunn merged commit ce5faab into kubernetes-retired:v0.16.x Sep 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants