Intermittent kube-dns errors after upgrading kops to v1.9 #5103

erstaples · 2018-05-03T19:46:22Z

What kops version are you running? The command kops version, will display
this information.

Version 1.9.0 (git-cccd71e67)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/386"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

I followed the instructions here to update kube-dns from 1.14.9 to 1.14.10. Then I deployed several apps that use ExternalName services.

What happened after the commands executed?

I noticed intermittent errors in our PHP apps: PDO::__construct(): php_network_getaddresses: getaddrinfo failed: Name or service not known. Suspecting a kube-dns issue, I looked at the logs in kube-dns and saw the following repeated over-and-over:

I0503 19:12:44.971393       1 logs.go:41] skydns: incomplete CNAME chain from "my-rds-instance.us-west-1.rds.amazonaws.com.": failure to lookup name

What did you expect to happen?

I expected ExternalName DNS queries to resolve every time.

Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-04-24T16:19:39Z
  name: prod-cluster.example.io
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: <redacted>
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-1a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-1a
      name: a
    name: events
  fileAssets:
  - content: |
      yes
    name: docker-1.12
    path: /etc/coreos/docker-1.12
    roles:
    - Node
    - Master
  - content: |
      #cloud-config
      coreos:
        update:
          reboot-strategy: "etcd-lock"
        locksmith:
          window-start: Tue 09:00
          window-length: 2h
    name: cloud-config.yaml
    path: /home/core/cloud-config.yaml
    roles:
    - Node
    - Master
  hooks:
  - manifest: |
      <redacted>
	 roles:
    - Node
    - Master
  - manifest: |
      Type=oneshot
      ExecStart=/usr/bin/coreos-cloudinit --from-file=/home/core/cloud-config.yaml
    name: reboot-window.service
    roles:
    - Node
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    authorizationRbacSuperUser: admin
    featureGates:
      TaintBasedEvictions: "true"
      TaintNodesByCondition: "true"
  kubeControllerManager:
    featureGates:
      TaintBasedEvictions: "true"
      TaintNodesByCondition: "true"
  kubeScheduler:
    featureGates:
      TaintBasedEvictions: "true"
      TaintNodesByCondition: "true"
  kubelet:
    featureGates:
      TaintBasedEvictions: "true"
      TaintNodesByCondition: "true"
  kubernetesApiAccess:
  - <redacted>
  kubernetesVersion: 1.9.3
  masterPublicName: api.<redacted>
  networkCIDR: <redacted> 
  networkID: <redacted>
  networking:
    canal: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - <redacted>
  subnets:
  - cidr: <redacted>
    name: us-west-1a
    type: Public
    zone: us-west-1a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-04-24T16:19:44Z
  labels:
    kops.k8s.io/cluster: <redacted> 
  name: <redacted>
spec:
  additionalSecurityGroups:
  - <redacted>
  image: ami-a9c2d0c9
  machineType: c5.large
  maxSize: 10
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: <redacted> 
  role: Node
  subnets:
  - us-west-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-04-24T16:19:45Z
  labels:
    kops.k8s.io/cluster: <redacted> 
  name: dedicated-memory
spec:
  additionalSecurityGroups:
  - <redacted>
  image: ami-a9c2d0c9
  machineType: m5.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: dedicated-memory
  role: Node
  subnets:
  - us-west-1a
  taints:
  - dedicated=memory:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-04-24T16:19:40Z
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-1a
spec:
  additionalSecurityGroups:
  - <redacted>
  image: ami-a9c2d0c9
  machineType: m5.2xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-1a
  role: Master
  subnets:
  - us-west-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-04-24T16:19:40Z
  labels:
    kops.k8s.io/cluster: <redacted> 
  name: nodes
spec:
  additionalSecurityGroups:
  - <redacted>
  image: ami-a9c2d0c9
  machineType: c5.2xlarge
  maxSize: 4
  minSize: 4
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-west-1a

Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

N/A. Not a kops CLI related issue.

Anything else do we need to know?

Please let me know if this is not appropriate for this repo.

The text was updated successfully, but these errors were encountered:

erstaples · 2018-05-15T23:26:17Z

Reverting kube-dns to 1.14.5 fixed it for me. No more intermittent errors.

mootpt · 2018-06-13T22:02:21Z

was also seeing high latency when attempting to resolve rds in some cases. Worth noting that 2 of 3 of my clusters were not exhibiting this behavior. All clusters running kube-dns:1.14.9. @erstaples did you attempt upgrading to 1.14.9 once more to see if the behavior remanifested? This issue is perplexing. I'd say reopen this issue until we get official word on the culprit.

ruudk · 2019-08-30T13:11:37Z

Seeing the same flood of errors with k8s.gcr.io/k8s-dns-kube-dns-amd64:1.15.4. Anybody an idea how to solve this?

felipejfc mentioned this issue May 8, 2018

High DNS latency after upgrading to kubernetes 1.9.7 #5116

Closed

erstaples closed this as completed May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent kube-dns errors after upgrading kops to v1.9 #5103

Intermittent kube-dns errors after upgrading kops to v1.9 #5103

erstaples commented May 3, 2018

erstaples commented May 15, 2018 •

edited

mootpt commented Jun 13, 2018

ruudk commented Aug 30, 2019

Intermittent kube-dns errors after upgrading kops to v1.9 #5103

Intermittent kube-dns errors after upgrading kops to v1.9 #5103

Comments

erstaples commented May 3, 2018

erstaples commented May 15, 2018 • edited

mootpt commented Jun 13, 2018

ruudk commented Aug 30, 2019

erstaples commented May 15, 2018 •

edited