Update CoreDNS to v1.12 to fix OOM & restart #1037

liheyuan · 2018-08-07T05:14:53Z

BUG REPORT

Versions

kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:50:16Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu 16.04 LTS X64
Kernel (e.g. uname -a): 4.4.0-91-generic Kubeadm should run kube-proxy under its' own identity #114-Ubuntu SMP
Others:

What happened?

core dns keep oom & restart, other pod works fine

get pod status

NAMESPACE NAME READY STATUS RESTARTS AGE
....
kube-system coredns-78fcdf6894-ls2q4 0/1 CrashLoopBackOff 12 1h
kube-system coredns-78fcdf6894-xn75c 0/1 CrashLoopBackOff 12 1h
....

describ the pod

Name: coredns-78fcdf6894-ls2q4
Namespace: kube-system
Priority: 0
PriorityClassName:
Node: k8s1/172.21.0.8
Start Time: Tue, 07 Aug 2018 11:59:37 +0800
Labels: k8s-app=kube-dns
pod-template-hash=3497892450
Annotations: cni.projectcalico.org/podIP=192.168.0.7/32
Status: Running
IP: 192.168.0.7
Controlled By: ReplicaSet/coredns-78fcdf6894
Containers:
coredns:
Container ID: docker://519046f837c93439a77d75288e6d630cdbcefe875b0bdb6aa5409d566070ec03
Image: k8s.gcr.io/coredns:1.1.3
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:db2bf53126ed1c761d5a41f24a1b82a461c85f736ff6e90542e9522be4757848
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 07 Aug 2018 13:07:21 +0800
Finished: Tue, 07 Aug 2018 13:08:21 +0800
Ready: False
Restart Count: 12
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-tsv2g (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-tsv2g:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-tsv2g
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Warning Unhealthy 44m kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: dial tcp 192.168.0.7:8080: connect: connection refused
Normal Pulled 41m (x5 over 1h) kubelet, k8s1 Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
Normal Created 41m (x5 over 1h) kubelet, k8s1 Created container
Normal Started 41m (x5 over 1h) kubelet, k8s1 Started container
Warning Unhealthy 40m kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: read tcp 172.21.0.8:40972->192.168.0.7:8080: read: connection reset by peer
Warning Unhealthy 34m (x2 over 38m) kubelet, k8s1 Liveness probe failed: Get http://192.168.0.7:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 4m (x124 over 44m) kubelet, k8s1 Back-off restarting failed container

logs of pod

.:53
CoreDNS-1.1.3
linux/amd64, go1.10.1, b0fd575c
2018/08/07 05:13:27 [INFO] CoreDNS-1.1.3
2018/08/07 05:13:27 [INFO] linux/amd64, go1.10.1, b0fd575c
2018/08/07 05:13:27 [INFO] plugin/reload: Running configuration MD5 = 2a066f12ec80aeb2b92740dd74c17138

ram usage of master

          total        used        free      shared  buff/cache   available

Mem: 1872 711 365 8 795 960
Swap: 0 0 0

ram usage of slave

          total        used        free      shared  buff/cache   available

Mem: 1872 392 78 17 1400 1250
Swap: 0 0 0

What you expected to happen?

core dns keep working and not restart

How to reproduce it (as minimally and precisely as possible)?

kubeadm init --apiserver-advertise-address=10.4.96.3 --pod-network-cidr=192.168.0.0/16
use calico network mode

join on second slave machine

node status is ready for both
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s1 Ready master 1h v1.11.1
k8s2 Ready 1h v1.11.1

Anything else we need to know?

I'm doing test on host with 2GB RAM, not sure if it is too small for k8s

The text was updated successfully, but these errors were encountered:

chrisohaver · 2018-08-07T14:10:20Z

There is a known issue on Ubuntu, where kubeadm sets up CoreDNS (and also kube-dns) incorrectly.
If Ubuntu is using resolved as it does by default in recent versions, then its /etc/resolv.conf contains a localhost address (127.0.0.53). Kubernetes pushes this configuration to all "default" dns policy pods, so when they forward lookups upstream, it comes back at them ... looping until oom.

The fixes are to update kubelet to use the correct resolv.conf(the one that resolved uses). Or you can directly configure the upstream proxy in your CoreDNS configmap (but that doesnt fix the issue for other pods that might have the "default" dns policy). Or you can disable resolved on the nodes.

chrisohaver · 2018-08-07T14:13:09Z

FYI, the kubelet flag is --resolv-conf=<path>

chrisohaver · 2018-08-07T14:25:43Z

There is a known issue on Ubuntu, where kubeadm sets up CoreDNS (and also kube-dns) incorrectly.

To be more correct, its kubelet that is set up incorrectly, not coredns/kubedns directly.

Next version of CoreDNS will be able to detect this misconfiguration and put warnings/errors in the logs. But thats not a fix. It just makes the failure less mysterious.

Not sure if its up to kubeadm to detect use of resolved and adjust the kubelet config accordingly during kubeadm init automatically. Perhaps it should do a preflight check it can look for local addresses in the /etc/resolv.conf, or for presence of resolved running, and then warn the user.

neolit123 · 2018-08-07T19:46:20Z

hi @liheyuan ,

when you start kubeadm init kubeadm should generate a file /var/lib/kubelet/kubeadm-flags.env that should handle the systemd-resolved issue automatically for you:
https://kubernetes.io/docs/setup/independent/kubelet-integration/

what are the contents of the file when you start kubeadm init and is your OS using systemd-resolved?

have you tried a different CNI e.g. weavenet or flannel?

chrisohaver · 2018-08-07T19:57:51Z

Ah... I didn't know about https://github.com/kubernetes/kubernetes#64665. Good to know!

liheyuan · 2018-08-08T10:20:08Z

@chrisohaver Thanks for your reply ,I'll have a try
@neolit123 Thank you. flannel not work in our scenario, may be I'll try weavenet in the future.

Also @chrisohaver @neolit123 I try to modify the core dns's Pod define, increase memory limit from 170MB(default) to 256MB, and it works like a charm... May be this is another solution.

neolit123 · 2018-08-08T12:15:07Z

@liheyuan

170MB(default) to 256MB, and it works like a charm... May be this is another solution.

thanks for finding that.

@chrisohaver
do you have an idea why the memory cap causes a problem?
i think it's OK to keep the issue here in case you'd suggest us to bump the memory cap to 256mb in the kubeadm manifest.

chrisohaver · 2018-08-08T13:03:07Z

do you have an idea why the memory cap causes a problem?

No - in fact, the CoreDNS manifests don't have a memory cap defined by default. So I don't know where the cap was introduced. Possibly in this cluster, kube-system has a default container memory limit? Though I don't think thats a default setting either.

chrisohaver · 2018-08-08T13:09:32Z

@liheyuan thanks for noticing the low memory cap.

By any chance, did you add the initial 170 memory limit to the coredns deployment, or perhaps add a container memory limit to the kube-system namespace? Trying to understand how the limit was introduced in your case.

liheyuan · 2018-08-12T12:26:58Z

@chrisohaver I'm not sure, the 170MB limit was found when I export coredns's yaml using kubectl

asipser · 2018-08-13T20:39:07Z

I'm also using kubeadm to launch a local kube cluster and am running into the same issue. I also have the 170Mi cap in the yaml for the coredns deployment. I can't seem to get it workingm unlike @liheyuan. After i kubeadm init I see nothing related to systemd-resolved @neolit123, am I doing anything wrong? I have the most recent version of kubeadm.

chrisohaver · 2018-08-14T13:48:19Z

@asipser what did you set memory cap to?

chrisohaver · 2018-08-14T13:50:33Z

@neolit123, Would kubeadm set up memory caps in a cluster by default? E.g. in the kube-system namespace, or directly in the coredns deployment?

chrisohaver · 2018-08-14T14:04:25Z

@neolit123, sorry It was just brought to my attention that there is a hard coded memory limit (that is too small) in the deployment in kubernetes repo. It's not in the coredns deployment, which is where i looked earlier. I'm not 100% clear on the reasoning for adding it in kubernetes repo copy of it. I believe it was copied from the kube-dns settings. We're updating that now...

asipser · 2018-08-14T16:17:20Z

I got it working by starting up the system-resolved service, which updated my /etc/resolv.conf properly. Even with the 170Mi cap I could get coredns working. Thanks anyways @chrisohaver.

chrisohaver · 2018-08-14T16:30:53Z

@asipser Glad its working for you. Take care that system-resolved hasnt put local address 127.0.0.53 in /etc/resolv.conf ... that will cause problems for upstream lookups.

neolit123 · 2018-08-14T17:22:13Z

@chrisohaver sorry for not looking earlier.
i see the PR is up already.

chrisohaver · 2018-08-14T18:05:45Z

@liheyuan, I'm trying to understand the root cause of this issue better. If you don't mind sharing, do you happen to know what DNS QPS rates your cluster is exhibiting? Under high load, coredns can use more memory.

liheyuan · 2018-08-25T14:36:38Z

@chrisohaver Sorry for my late reply.

I'm setting the k8s cluster as an test env, so the QPS of dns is very low, around ~2 / sec.

timothysc · 2018-08-27T20:08:50Z

Hey folks, do we have a canonical repo setup? I'm seeing a lot of anecdotal details, but not a 100% consistent reproducer...

neolit123 · 2018-08-29T14:47:00Z

fixed in the latest coredns as outlined in:
kubernetes/kubernetes#67392 (comment)

chrisohaver · 2018-08-29T15:18:54Z

@liheyuan how often is CoreDNS OOM restarting. If we assume the root cause was the recently fixed cache issue: At your cluster's 2 QPS (as you say above), it would take at minimum about 24 hours for cache to exaust... and even then, only if every query made is unique (~230000 unique dns names), which is extremely unusual.

timothysc · 2018-08-29T15:50:52Z

I'm reopening as we need a PR to update the CoreDNS image version to 1.2.2 and PR to update the image in gcr.io

rajansandeep · 2018-08-29T15:56:59Z

Yes, @timothysc I will be pushing the PR once the CoreDNS image is available in gcr.io

timothysc · 2018-08-29T17:51:53Z

xref - kubernetes/kubernetes#68020

rajansandeep · 2018-08-29T17:53:52Z

@timothysc you mean update CoreDNS to v1.2.2?

chrisohaver · 2018-08-29T18:32:51Z

@timothysc, This issue is in a test environment with 2 QPS. I really don't think it's related to the cache issue fixed in CoreDNS 1.2.2 at all (which requires high QPS to manifest).

This could instead be a case of kubernetes/kubernetes#64665 failing to detect systemd-resolved, and adjust kubelet flags... or perhaps systemd-resolved failed and left the system in a bad state (e.g. /etc/resolv.conf still contains local address, but systemd-resolved isn't running).

kubernetes/kubernetes#64665 checks to see if systemd-resolved is running. if it isnt, it assumes /etc/resolv.conf is OK. However, I see a comment on stack exchange (albeit old) about how to disable systemd-resolved which suggest that simply disabling the service leaves /etc/resolv.conf in a bad state.

liheyuan · 2018-08-30T05:31:02Z

@chrisohaver Just as the first topic, It keeps restart, which means it crash -> restart , when I use dns to ping a cluster service , it crash again.

No DNS query, no crash.

After a query, it then crash.

chrisohaver · 2018-08-30T12:22:51Z

No DNS query, no crash.
After a query, it then crash.

@liheyuan, This behavior lines up with infinite recursion caused by a local address present in /etc/resolv.conf in the coredns pod. A single query can cause coredns to infinitely forward the query onto itself, resulting in OOM.

Please check the following...

What is the contents of /etc/resolv.conf on the host node of coredns?
What is the contents of /var/lib/kubelet/kubeadm-flags.env
Are you running systemd-resolved on any of the nodes in your cluster?

dims · 2018-08-31T14:00:40Z

@timothysc fyi CoreDNS version 1.2.2 is now available on gcr.io

liheyuan · 2018-09-01T15:22:30Z

@chrisohaver

What is the contents of /etc/resolv.conf on the host node of coredns?

nameserver 183.60.83.19
nameserver 183.60.82.98

What is the contents of /var/lib/kubelet/kubeadm-flags.env

KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --network-plugin=cni

Are you running systemd-resolved on any of the nodes in your cluster?

nope

I have also check DNS Pod's resolv.conf, it's also

nameserver 183.60.83.19
nameserver 183.60.82.98

BTW,
I am investigating on other App Pods, suspecting that a bad Pod has caused a DNS infinite loop.

chrisohaver · 2018-09-04T13:23:33Z

@liheyuan, Generally loops are caused by forwarding paths, and your /etc/resolv.conf shows that there are no self loops there. It is possible that your upstreams are configured to forward back to CoreDNS, although this would be very unlikely (because there would not be a practical reason for doing so).

The other possibility is if your CoreDNS configmap is configured to forward to itself. But this is also not likely, because it's not the default configuration.

If you care to troubleshoot further, you can enable logging in coredns, by adding log to your coredns config. This will log every query coredns receives. If a forwarding loop is the culprit, it will be evident in the logs (you'd see the same query repeated ad infinitum in rapid succession). Or it may reveal other unusual behavior, for example if there is a delinquent pod spamming the DNS server.

The latest image of CoreDNS (1.2.1), also has a loop detection plugin which you can enable by adding loop to the coredns config.

.:53 {
   errors
   log
   loop

   [...]

}

xlgao-zju · 2018-09-05T03:28:50Z

@neolit123 Can we set the CoreDNS version(expect modifying the hard-coded CoreDNS version), when we run kubeadm init? For now, the default version of CoreDNS is 1.1.3 in kubeadm v1.11.2. I want to use CoreDNS 1.2.2.

neolit123 · 2018-09-05T13:40:24Z

@xlgao-zju as outlined here we have a bit of an issue with allowing only the custom coredns image/version:
#1091 (comment)
which means that we also need to allow custom addon configs (in this case a Corefile).

timothysc · 2018-09-05T16:23:44Z

We're going to close this issue, but folks can rally on config overrides on a different issue.

chrisohaver · 2018-09-06T01:13:56Z

@liheyuan, is your issue resolved?

swathichittajallu · 2018-09-06T09:36:21Z

@chrisohaver Hi. I'm facing same issue with coredns(loop restarts) and I see the issue is due to the memory limit of 170Mi. Can you suggest as to how I can update my coredns deployment to 1.2.2 or how to increase the memory limit of coredns deployment. I am using k8s version-1.11.2.

chrisohaver · 2018-09-06T12:41:42Z

@swathichittajallu

I see the issue is due to the memory limit of 170Mi

Sometimes this is the reason, but not always. Continuous Pod restarts can be caused by any error that causes a container in a Pod to exit (e.g. by crash, or by fatal error, or by being killed by another process).

You can edit the coredns Deployment... kubectl -n kube-system edit deployment coredns. In that yaml definition, you can either change the memory limit "170Mi" to a higher number, or you can change the image version to 1.2.2.

...
- name: coredns
        image: k8s.gcr.io/coredns:1.2.2
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            memory: 170Mi
          requests:
...

swathichittajallu · 2018-09-07T10:18:56Z

@chrisohaver Thanks a lot :) It worked. core dns pods are consuming close to 478Mi. So, it worked with memory limit=512Mi

bw2 · 2018-10-04T13:34:38Z

Another way to update the coredns version and raise the memory limit:

kubectl patch deployment -n=kube-system coredns -p '{"spec": {"template": {"spec":{"containers":[{"image":"k8s.gcr.io/coredns:1.2.2", "name":"coredns","resources":{"limits":{"memory":"1Gi"},"requests":{"cpu":"100m","memory":"70Mi"}}}]}}}}'

schwankner · 2018-10-05T15:08:03Z

I have kubeadm v1.12.0 on debian 9 and solved this issue by switching from calico to weave.

iamthecloudguy · 2018-10-31T19:12:50Z

i was facing the same issue since last 2 days, i created 6 vms to resolve this issue. :) .
i tried Ubuntu 16.04.1 & 18.4 with kubernetes v1.12.2
. i notice that with combination of v1.12.2 + Ubuntu 18.4 version we dont need to /var/lib/kubelet/kubeadm-flags.env update this file , this is already updated with --resolv-conf=. in my case the issue with CIDR value ..so i was using flannel network plugin and flannel only supported CIDR valute is - cidr=10.244.0.0/16 and i was trying something like - 20.0.0.0/16 and this is not supported.
now there is no issue with Ubuntu 16.04.1 & 18.4 versions, i have not done any kind of dns settings , everything started working automatically . if your dns pods are not working then check your networking plugin setting instead of checking Coredns settings.

i am posting complete command list to create kubeadm cluster - just follow this

curl -sL https://gist.githubusercontent.com/alexellis/7315e75635623667c32199368aa11e95/raw/b025dfb91b43ea9309ce6ed67e24790ba65d7b67/kube.sh | sudo sh

sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=10.1.1.5 --kubernetes-version stable { You must replace --apiserver-advertise-address with the IP of your master host)

sudo useradd kubeadmin -G sudo -m -s /bin/bash

sudo passwd kubeadmin

sudo su kubeadmin

cd $HOME

sudo cp /etc/kubernetes/admin.conf $HOME/

sudo chown $(id -u):$(id -g) $HOME/admin.conf

export KUBECONFIG=$HOME/admin.conf

echo "export KUBECONFIG=$HOME/admin.conf" | tee -a ~/.bashrc

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

kubectl taint nodes --all node-role.kubernetes.io/master-

kubectl get all --namespace=kube-system

please try above commands to create your cluster ..let me know if this is working for you.

neolit123 added the priority/needs-more-evidence label Aug 7, 2018

rajansandeep mentioned this issue Aug 14, 2018

Bump memory limits in CoreDNS deployments kubernetes/kubernetes#67392

Closed

timothysc self-assigned this Aug 27, 2018

timothysc modified the milestone: v1.12 Aug 27, 2018

neolit123 closed this as completed Aug 29, 2018

timothysc reopened this Aug 29, 2018

timothysc added kind/bug Categorizes issue or PR as related to a bug. and removed priority/needs-more-evidence labels Aug 29, 2018

timothysc added this to the v1.12 milestone Aug 29, 2018

timothysc changed the title ~~k8s 1.11.1 coredns keep OOM & restart~~ Update CoreDNS to v1.12 to fix OOM & restart Aug 29, 2018

timothysc added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 29, 2018

timothysc closed this as completed Sep 5, 2018

maphysics mentioned this issue Nov 19, 2018

CoreDNS yaml file missing kelseyhightower/kubernetes-the-hard-way#405

Closed

Update CoreDNS to v1.12 to fix OOM & restart #1037

Update CoreDNS to v1.12 to fix OOM & restart #1037

Comments

liheyuan commented Aug 7, 2018 • edited Loading

BUG REPORT

Versions

What happened?

get pod status

describ the pod

logs of pod

ram usage of master

ram usage of slave

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

chrisohaver commented Aug 7, 2018

chrisohaver commented Aug 7, 2018

chrisohaver commented Aug 7, 2018

neolit123 commented Aug 7, 2018 • edited Loading

chrisohaver commented Aug 7, 2018

liheyuan commented Aug 8, 2018 • edited Loading

neolit123 commented Aug 8, 2018

chrisohaver commented Aug 8, 2018

chrisohaver commented Aug 8, 2018

liheyuan commented Aug 12, 2018

asipser commented Aug 13, 2018 • edited Loading

chrisohaver commented Aug 14, 2018

chrisohaver commented Aug 14, 2018

chrisohaver commented Aug 14, 2018

asipser commented Aug 14, 2018

chrisohaver commented Aug 14, 2018 • edited Loading

neolit123 commented Aug 14, 2018

chrisohaver commented Aug 14, 2018

liheyuan commented Aug 25, 2018

timothysc commented Aug 27, 2018

neolit123 commented Aug 29, 2018

chrisohaver commented Aug 29, 2018

timothysc commented Aug 29, 2018

rajansandeep commented Aug 29, 2018

timothysc commented Aug 29, 2018

rajansandeep commented Aug 29, 2018

chrisohaver commented Aug 29, 2018

liheyuan commented Aug 30, 2018

chrisohaver commented Aug 30, 2018

dims commented Aug 31, 2018

liheyuan commented Sep 1, 2018

chrisohaver commented Sep 4, 2018

xlgao-zju commented Sep 5, 2018

neolit123 commented Sep 5, 2018

timothysc commented Sep 5, 2018

chrisohaver commented Sep 6, 2018

swathichittajallu commented Sep 6, 2018 • edited Loading

chrisohaver commented Sep 6, 2018 • edited Loading

swathichittajallu commented Sep 7, 2018

bw2 commented Oct 4, 2018

schwankner commented Oct 5, 2018

iamthecloudguy commented Oct 31, 2018 • edited Loading

liheyuan commented Aug 7, 2018 •

edited

Loading

neolit123 commented Aug 7, 2018 •

edited

Loading

liheyuan commented Aug 8, 2018 •

edited

Loading

asipser commented Aug 13, 2018 •

edited

Loading

chrisohaver commented Aug 14, 2018 •

edited

Loading

swathichittajallu commented Sep 6, 2018 •

edited

Loading

chrisohaver commented Sep 6, 2018 •

edited

Loading

iamthecloudguy commented Oct 31, 2018 •

edited

Loading