Unable to revert deployment to previous version #43948

justinsb · 2017-04-01T15:20:27Z

K8s 1.6.0: When I try setting a deployment back to a version that already existed (i.e. A -> B -> A), I get the following error logged in k-c-m:

I0401 15:21:52.223300       5 deployment_controller.go:480] Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists
I0401 15:21:52.232857       5 deployment_controller.go:480] Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists
I0401 15:21:52.247466       5 deployment_controller.go:480] Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists
I0401 15:21:52.272142       5 deployment_controller.go:480] Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists
I0401 15:21:52.319308       5 deployment_controller.go:480] Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists
E0401 15:21:52.404181       5 deployment_controller.go:485] replicasets.extensions "kube-dns-1321724180" already exists
I0401 15:21:52.404279       5 deployment_controller.go:486] Dropping deployment "kube-system/kube-dns" out of the queue: replicasets.extensions "kube-dns-1321724180" already exists

(In my case, I was changing the kube-dns config map from optional: true -> optional: false -> optional: true, around a 1.5 -> 1.6 -> 1.5 upgrade / downgrade)

This is particularly problematic because the new deployment is then not configured - the pod retains its old configuration.

I was able to reproduce this repeatedly, but then I deleted the replicaset and the system was then able to recover and I could no longer reproduce it.

The text was updated successfully, but these errors were encountered:

0xmichalis · 2017-04-01T21:48:29Z

Is this from local testing? Can you post the full log from the controller manager? Can you also post the output of:

kubectl get deploy,rs,pods -l kube-dns_label_here -n kube-system -o yaml
kubectl get events -n kube-system

0xmichalis · 2017-04-01T22:25:17Z

It seems that the controller manager was fast enough to process the kube-dns deployment in less than 0.2 seconds before the cache got updated with the replica set. Can happen when your queues are empty. We need to verify that rate-limiting actually works and maybe revisit its settings. Also we probably want to bump the retries to something higher, the deployment controller is the only one out of the workload controllers that drops objects out of its queue after some retry count.

0xmichalis · 2017-04-02T18:19:08Z

@kubernetes/sig-apps-bugs @deads2k

0xmichalis · 2017-04-02T19:12:45Z

Rate-limiting works fine, we need to add more retries.

justinsb · 2017-04-04T02:22:39Z

Thanks for looking into it - I was able to reproduce. My suspicion is that it is not a race, but rather because of newly introduced fields across versions.

>kubectl get deploy,rs,pods -l k8s-addon=kube-dns.addons.k8s.io -n kube-system -oyaml

(I inserted some sections for sanity)

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "4"
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"labels":{"k8s-addon":"kube-dns.addons.k8s.io","k8s-app":"kube-dns","kubernetes.io/cluster-service":"true"},"name":"kube-dns","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"kube-dns"}},"strategy":{"rollingUpdate":{"maxSurge":"10%","maxUnavailable":0}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"kube-dns"}},"spec":{"containers":[{"args":["--domain=cluster.local.","--dns-port=10053","--config-dir=/kube-dns-config","--v=2"],"env":[{"name":"PROMETHEUS_PORT","value":"10055"}],"image":"gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/kubedns","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"kubedns","ports":[{"containerPort":10053,"name":"dns-local","protocol":"UDP"},{"containerPort":10053,"name":"dns-tcp-local","protocol":"TCP"},{"containerPort":10055,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/readiness","port":8081,"scheme":"HTTP"},"initialDelaySeconds":3,"timeoutSeconds":5},"resources":{"limits":{"memory":"170Mi"},"requests":{"cpu":"100m","memory":"70Mi"}},"volumeMounts":[{"mountPath":"/kube-dns-config","name":"kube-dns-config"}]},{"args":["-v=2","-logtostderr","-configDir=/etc/k8s/dns/dnsmasq-nanny","-restartDnsmasq=true","--","-k","--cache-size=1000","--log-facility=-","--server=/cluster.local/127.0.0.1#10053","--server=/in-addr.arpa/127.0.0.1#10053","--server=/in6.arpa/127.0.0.1#10053"],"image":"gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/dnsmasq","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"dnsmasq","ports":[{"containerPort":53,"name":"dns","protocol":"UDP"},{"containerPort":53,"name":"dns-tcp","protocol":"TCP"}],"resources":{"requests":{"cpu":"150m","memory":"20Mi"}},"volumeMounts":[{"mountPath":"/etc/k8s/dns/dnsmasq-nanny","name":"kube-dns-config"}]},{"args":["--v=2","--logtostderr","--probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A","--probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A"],"image":"gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/metrics","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"sidecar","ports":[{"containerPort":10054,"name":"metrics","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"20Mi"}}}],"dnsPolicy":"Default","serviceAccountName":"kube-dns","volumes":[{"configMap":{"name":"kube-dns","optional":true},"name":"kube-dns-config"}]}}}}
    creationTimestamp: 2017-04-03T15:27:40Z
    generation: 5
    labels:
      k8s-addon: kube-dns.addons.k8s.io
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
    name: kube-dns
    namespace: kube-system
    resourceVersion: "44633"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-dns
    uid: 1323102f-1882-11e7-a2e0-0aced200add4
  spec:
    replicas: 2
    selector:
      matchLabels:
        k8s-app: kube-dns
    strategy:
      rollingUpdate:
        maxSurge: 10%
        maxUnavailable: 0
      type: RollingUpdate
    template:
      metadata:
        annotations:
          scheduler.alpha.kubernetes.io/critical-pod: ""
          scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
            "operator":"Exists"}]'
        creationTimestamp: null
        labels:
          k8s-app: kube-dns
      spec:
        containers:
        - args:
          - --domain=cluster.local.
          - --dns-port=10053
          - --config-dir=/kube-dns-config
          - --v=2
          env:
          - name: PROMETHEUS_PORT
            value: "10055"
          image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/kubedns
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: kubedns
          ports:
          - containerPort: 10053
            name: dns-local
            protocol: UDP
          - containerPort: 10053
            name: dns-tcp-local
            protocol: TCP
          - containerPort: 10055
            name: metrics
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readiness
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /kube-dns-config
            name: kube-dns-config
        - args:
          - -v=2
          - -logtostderr
          - -configDir=/etc/k8s/dns/dnsmasq-nanny
          - -restartDnsmasq=true
          - --
          - -k
          - --cache-size=1000
          - --log-facility=-
          - --server=/cluster.local/127.0.0.1#10053
          - --server=/in-addr.arpa/127.0.0.1#10053
          - --server=/in6.arpa/127.0.0.1#10053
          image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/dnsmasq
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: dnsmasq
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          resources:
            requests:
              cpu: 150m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/k8s/dns/dnsmasq-nanny
            name: kube-dns-config
        - args:
          - --v=2
          - --logtostderr
          - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
          - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
          image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /metrics
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: sidecar
          ports:
          - containerPort: 10054
            name: metrics
            protocol: TCP
          resources:
            requests:
              cpu: 10m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: Default
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: kube-dns
        serviceAccountName: kube-dns
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: kube-dns
            optional: true
          name: kube-dns-config
  status:
    conditions:
    - lastTransitionTime: 2017-04-03T16:59:39Z
      lastUpdateTime: 2017-04-03T16:59:39Z
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    observedGeneration: 4
    replicas: 2
    unavailableReplicas: 2
    updatedReplicas: 2

- apiVersion: extensions/v1beta1
  kind: ReplicaSet
  metadata:
    annotations:
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "3"
    creationTimestamp: 2017-04-03T15:55:37Z
    generation: 5
    labels:
      k8s-app: kube-dns
      pod-template-hash: "1321724180"
    name: kube-dns-1321724180
    namespace: kube-system
    ownerReferences:
    - apiVersion: extensions/v1beta1
      controller: true
      kind: Deployment
      name: kube-dns
      uid: 1323102f-1882-11e7-a2e0-0aced200add4
    resourceVersion: "6967"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/replicasets/kube-dns-1321724180
    uid: fa88b571-1885-11e7-985a-0abf8cb3620a
  spec:
    replicas: 0
    selector:
      matchLabels:
        k8s-app: kube-dns
        pod-template-hash: "1321724180"
    template:
      metadata:
        annotations:
          scheduler.alpha.kubernetes.io/critical-pod: ""
          scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
            "operator":"Exists"}]'
        creationTimestamp: null
        labels:
          k8s-app: kube-dns
          pod-template-hash: "1321724180"
      spec:
        containers:
        - args:
          - --domain=cluster.local.
          - --dns-port=10053
          - --config-dir=/kube-dns-config
          - --v=2
          env:
          - name: PROMETHEUS_PORT
            value: "10055"
          image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/kubedns
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: kubedns
          ports:
          - containerPort: 10053
            name: dns-local
            protocol: UDP
          - containerPort: 10053
            name: dns-tcp-local
            protocol: TCP
          - containerPort: 10055
            name: metrics
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readiness
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /kube-dns-config
            name: kube-dns-config
        - args:
          - -v=2
          - -logtostderr
          - -configDir=/etc/k8s/dns/dnsmasq-nanny
          - -restartDnsmasq=true
          - --
          - -k
          - --cache-size=1000
          - --log-facility=-
          - --server=/cluster.local/127.0.0.1#10053
          - --server=/in-addr.arpa/127.0.0.1#10053
          - --server=/in6.arpa/127.0.0.1#10053
          image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/dnsmasq
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: dnsmasq
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          resources:
            requests:
              cpu: 150m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/k8s/dns/dnsmasq-nanny
            name: kube-dns-config
        - args:
          - --v=2
          - --logtostderr
          - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
          - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
          image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /metrics
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: sidecar
          ports:
          - containerPort: 10054
            name: metrics
            protocol: TCP
          resources:
            requests:
              cpu: 10m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: Default
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: kube-dns
        serviceAccountName: kube-dns
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: kube-dns
          name: kube-dns-config
  status:
    observedGeneration: 5
    replicas: 0

- apiVersion: extensions/v1beta1
  kind: ReplicaSet
  metadata:
    annotations:
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: 2017-04-03T15:27:40Z
    generation: 4
    labels:
      k8s-app: kube-dns
      pod-template-hash: "141550303"
    name: kube-dns-141550303
    namespace: kube-system
    ownerReferences:
    - apiVersion: extensions/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Deployment
      name: kube-dns
      uid: 1323102f-1882-11e7-a2e0-0aced200add4
    resourceVersion: "3624"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/replicasets/kube-dns-141550303
    uid: 134bd95b-1882-11e7-a2e0-0aced200add4
  spec:
    replicas: 0
    selector:
      matchLabels:
        k8s-app: kube-dns
        pod-template-hash: "141550303"
    template:
      metadata:
        annotations:
          scheduler.alpha.kubernetes.io/critical-pod: ""
          scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
            "operator":"Exists"}]'
        creationTimestamp: null
        labels:
          k8s-app: kube-dns
          pod-template-hash: "141550303"
      spec:
        containers:
        - args:
          - --domain=cluster.local.
          - --dns-port=10053
          - --config-dir=/kube-dns-config
          - --v=2
          env:
          - name: PROMETHEUS_PORT
            value: "10055"
          image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/kubedns
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: kubedns
          ports:
          - containerPort: 10053
            name: dns-local
            protocol: UDP
          - containerPort: 10053
            name: dns-tcp-local
            protocol: TCP
          - containerPort: 10055
            name: metrics
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readiness
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /kube-dns-config
            name: kube-dns-config
        - args:
          - -v=2
          - -logtostderr
          - -configDir=/etc/k8s/dns/dnsmasq-nanny
          - -restartDnsmasq=true
          - --
          - -k
          - --cache-size=1000
          - --log-facility=-
          - --server=/cluster.local/127.0.0.1#10053
          - --server=/in-addr.arpa/127.0.0.1#10053
          - --server=/in6.arpa/127.0.0.1#10053
          image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/dnsmasq
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: dnsmasq
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          resources:
            requests:
              cpu: 150m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/k8s/dns/dnsmasq-nanny
            name: kube-dns-config
        - args:
          - --v=2
          - --logtostderr
          - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
          - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
          image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /metrics
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: sidecar
          ports:
          - containerPort: 10054
            name: metrics
            protocol: TCP
          resources:
            requests:
              cpu: 10m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: Default
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: kube-dns
          name: kube-dns-config
  status:
    observedGeneration: 4
    replicas: 0

- apiVersion: extensions/v1beta1
  kind: ReplicaSet
  metadata:
    annotations:
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "4"
      deployment.kubernetes.io/revision-history: "2"
    creationTimestamp: 2017-04-03T15:32:47Z
    generation: 4
    labels:
      k8s-app: kube-dns
      pod-template-hash: "1443162385"
    name: kube-dns-1443162385
    namespace: kube-system
    ownerReferences:
    - apiVersion: extensions/v1beta1
      controller: true
      kind: Deployment
      name: kube-dns
      uid: 1323102f-1882-11e7-a2e0-0aced200add4
    resourceVersion: "8179"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/replicasets/kube-dns-1443162385
    uid: ca3799d2-1882-11e7-a2e0-0aced200add4
  spec:
    replicas: 2
    selector:
      matchLabels:
        k8s-app: kube-dns
        pod-template-hash: "1443162385"
    template:
      metadata:
        annotations:
          scheduler.alpha.kubernetes.io/critical-pod: ""
          scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
            "operator":"Exists"}]'
        creationTimestamp: null
        labels:
          k8s-app: kube-dns
          pod-template-hash: "1443162385"
      spec:
        containers:
        - args:
          - --domain=cluster.local.
          - --dns-port=10053
          - --config-dir=/kube-dns-config
          - --v=2
          env:
          - name: PROMETHEUS_PORT
            value: "10055"
          image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/kubedns
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: kubedns
          ports:
          - containerPort: 10053
            name: dns-local
            protocol: UDP
          - containerPort: 10053
            name: dns-tcp-local
            protocol: TCP
          - containerPort: 10055
            name: metrics
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readiness
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /kube-dns-config
            name: kube-dns-config
        - args:
          - -v=2
          - -logtostderr
          - -configDir=/etc/k8s/dns/dnsmasq-nanny
          - -restartDnsmasq=true
          - --
          - -k
          - --cache-size=1000
          - --log-facility=-
          - --server=/cluster.local/127.0.0.1#10053
          - --server=/in-addr.arpa/127.0.0.1#10053
          - --server=/in6.arpa/127.0.0.1#10053
          image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /healthcheck/dnsmasq
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: dnsmasq
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          resources:
            requests:
              cpu: 150m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/k8s/dns/dnsmasq-nanny
            name: kube-dns-config
        - args:
          - --v=2
          - --logtostderr
          - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
          - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
          image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 5
            httpGet:
              path: /metrics
              port: 10054
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: sidecar
          ports:
          - containerPort: 10054
            name: metrics
            protocol: TCP
          resources:
            requests:
              cpu: 10m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: Default
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: kube-dns
        serviceAccountName: kube-dns
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: kube-dns
          name: kube-dns-config
  status:
    fullyLabeledReplicas: 2
    observedGeneration: 4
    replicas: 2

- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kube-dns-1443162385","uid":"ca3799d2-1882-11e7-a2e0-0aced200add4","apiVersion":"extensions","resourceVersion":"8149"}}
      scheduler.alpha.kubernetes.io/critical-pod: ""
      scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
    creationTimestamp: 2017-04-03T16:59:59Z
    generateName: kube-dns-1443162385-
    labels:
      k8s-app: kube-dns
      pod-template-hash: "1443162385"
    name: kube-dns-1443162385-l485c
    namespace: kube-system
    ownerReferences:
    - apiVersion: extensions/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: kube-dns-1443162385
      uid: ca3799d2-1882-11e7-a2e0-0aced200add4
    resourceVersion: "43273"
    selfLink: /api/v1/namespaces/kube-system/pods/kube-dns-1443162385-l485c
    uid: f889cde2-188e-11e7-996e-0a2a5c43cf94
  spec:
    containers:
    - args:
      - --domain=cluster.local.
      - --dns-port=10053
      - --config-dir=/kube-dns-config
      - --v=2
      env:
      - name: PROMETHEUS_PORT
        value: "10055"
      image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /healthcheck/kubedns
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: kubedns
      ports:
      - containerPort: 10053
        name: dns-local
        protocol: UDP
      - containerPort: 10053
        name: dns-tcp-local
        protocol: TCP
      - containerPort: 10055
        name: metrics
        protocol: TCP
      readinessProbe:
        failureThreshold: 3
        httpGet:
          path: /readiness
          port: 8081
          scheme: HTTP
        initialDelaySeconds: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      resources:
        limits:
          memory: 170Mi
        requests:
          cpu: 100m
          memory: 70Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /kube-dns-config
        name: kube-dns-config
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    - args:
      - -v=2
      - -logtostderr
      - -configDir=/etc/k8s/dns/dnsmasq-nanny
      - -restartDnsmasq=true
      - --
      - -k
      - --cache-size=1000
      - --log-facility=-
      - --server=/cluster.local/127.0.0.1#10053
      - --server=/in-addr.arpa/127.0.0.1#10053
      - --server=/in6.arpa/127.0.0.1#10053
      image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /healthcheck/dnsmasq
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: dnsmasq
      ports:
      - containerPort: 53
        name: dns
        protocol: UDP
      - containerPort: 53
        name: dns-tcp
        protocol: TCP
      resources:
        requests:
          cpu: 150m
          memory: 20Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /etc/k8s/dns/dnsmasq-nanny
        name: kube-dns-config
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    - args:
      - --v=2
      - --logtostderr
      - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
      - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
      image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: sidecar
      ports:
      - containerPort: 10054
        name: metrics
        protocol: TCP
      resources:
        requests:
          cpu: 10m
          memory: 20Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    dnsPolicy: Default
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: kube-dns
    serviceAccountName: kube-dns
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.alpha.kubernetes.io/notReady
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.alpha.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - configMap:
        defaultMode: 420
        name: kube-dns
      name: kube-dns-config
    - name: kube-dns-token-pp102
      secret:
        defaultMode: 420
        secretName: kube-dns-token-pp102
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-04-03T16:59:59Z
      message: 'No nodes are available that match all of the following predicates::
        Insufficient cpu (1), PodToleratesNodeTaints (1).'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
    phase: Pending
    qosClass: Burstable

- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kube-dns-1443162385","uid":"ca3799d2-1882-11e7-a2e0-0aced200add4","apiVersion":"extensions","resourceVersion":"8111"}}
      scheduler.alpha.kubernetes.io/critical-pod: ""
      scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
    creationTimestamp: 2017-04-03T16:59:58Z
    generateName: kube-dns-1443162385-
    labels:
      k8s-app: kube-dns
      pod-template-hash: "1443162385"
    name: kube-dns-1443162385-vx74w
    namespace: kube-system
    ownerReferences:
    - apiVersion: extensions/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: kube-dns-1443162385
      uid: ca3799d2-1882-11e7-a2e0-0aced200add4
    resourceVersion: "43274"
    selfLink: /api/v1/namespaces/kube-system/pods/kube-dns-1443162385-vx74w
    uid: f83b6c3a-188e-11e7-996e-0a2a5c43cf94
  spec:
    containers:
    - args:
      - --domain=cluster.local.
      - --dns-port=10053
      - --config-dir=/kube-dns-config
      - --v=2
      env:
      - name: PROMETHEUS_PORT
        value: "10055"
      image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /healthcheck/kubedns
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: kubedns
      ports:
      - containerPort: 10053
        name: dns-local
        protocol: UDP
      - containerPort: 10053
        name: dns-tcp-local
        protocol: TCP
      - containerPort: 10055
        name: metrics
        protocol: TCP
      readinessProbe:
        failureThreshold: 3
        httpGet:
          path: /readiness
          port: 8081
          scheme: HTTP
        initialDelaySeconds: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      resources:
        limits:
          memory: 170Mi
        requests:
          cpu: 100m
          memory: 70Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /kube-dns-config
        name: kube-dns-config
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    - args:
      - -v=2
      - -logtostderr
      - -configDir=/etc/k8s/dns/dnsmasq-nanny
      - -restartDnsmasq=true
      - --
      - -k
      - --cache-size=1000
      - --log-facility=-
      - --server=/cluster.local/127.0.0.1#10053
      - --server=/in-addr.arpa/127.0.0.1#10053
      - --server=/in6.arpa/127.0.0.1#10053
      image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /healthcheck/dnsmasq
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: dnsmasq
      ports:
      - containerPort: 53
        name: dns
        protocol: UDP
      - containerPort: 53
        name: dns-tcp
        protocol: TCP
      resources:
        requests:
          cpu: 150m
          memory: 20Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /etc/k8s/dns/dnsmasq-nanny
        name: kube-dns-config
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    - args:
      - --v=2
      - --logtostderr
      - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
      - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
      image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: 10054
          scheme: HTTP
        initialDelaySeconds: 60
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 5
      name: sidecar
      ports:
      - containerPort: 10054
        name: metrics
        protocol: TCP
      resources:
        requests:
          cpu: 10m
          memory: 20Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-dns-token-pp102
        readOnly: true
    dnsPolicy: Default
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: kube-dns
    serviceAccountName: kube-dns
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.alpha.kubernetes.io/notReady
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.alpha.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - configMap:
        defaultMode: 420
        name: kube-dns
      name: kube-dns-config
    - name: kube-dns-token-pp102
      secret:
        defaultMode: 420
        secretName: kube-dns-token-pp102
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-04-03T16:59:58Z
      message: 'No nodes are available that match all of the following predicates::
        Insufficient cpu (1), PodToleratesNodeTaints (1).'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
    phase: Pending
    qosClass: Burstable

justinsb · 2017-04-04T02:27:37Z

The error was

Error syncing deployment kube-system/kube-dns: replicasets.extensions "kube-dns-1321724180" already exists

I am 95% sure (sadly kubectl edit doesn't appear in scrollback) that the problem is that the last-applied-configuration had ..."volumes":[{"configMap":{"name":"kube-dns","optional":true},... but the optional: true did not appear in the volume mount, because it had been wiped by 1.5.

I then re-added optional: true and got the collision.

My hypothesis (though I don't understand the code) is that the hash (1321724180) is based on the configuration as applied, which might not accurately reflect the values actually set either on a cluster downgrade & upgrade, or when using an older version of kubectl/k8s originally (I am unsure exactly). So then I set the manifest to readd the missing field, and collide with the existing hashed value. Is that plausible?

0xmichalis · 2017-04-04T08:37:00Z

The hash is based on the Deployment PodTemplateSpec. If the Deployment controller can't find a ReplicaSet that has a semantically deep equal PodTemplateSpec, then it will create a new ReplicaSet by hashing the Deployment template (for the new RS name). There is a slight chance you will hit a hash collision with the current algo but this seems to be a problem with 200s of old ReplicaSets: #29735

I then re-added optional: true and got the collision.

You re-added optional: true where? What happened before you re-add it?

Can you try to patch the Deployment controller to use more retries (15 is a good number) and retry the upgrade-downgrade?

0xmichalis · 2017-04-04T09:18:11Z

Ok, this seems like a collision: https://www.diffchecker.com/E4CxPOdr
When was the 4th ReplicaSet created? After the downgrade? Unfortunately hash collisions between versions of Kubernetes are more likely due to changes in the PodSpec.

0xmichalis · 2017-04-04T09:23:12Z

I want a concrete timeline here:

In which version the 3rd RS was created?
In which version the 4th RS was created?
And more importantly: You re-added optional: true where? What happened before you re-add it?

0xmichalis · 2017-04-04T09:23:42Z

@kubernetes/sig-apps-bugs

Automatic merge from submit-queue [1.5] Update deployment retries to a saner count Safe-guard for failures like #43948

Automatic merge from submit-queue Switch Deployments to new hashing algo w/ collision avoidance mechanism Implements kubernetes/community#477 @kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews Fixes #29735 Fixes #43948 ```release-note Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore. ```

Automatic merge from submit-queue Switch Deployments to new hashing algo w/ collision avoidance mechanism Implements kubernetes/community#477 @kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews Fixes kubernetes/kubernetes#29735 Fixes kubernetes/kubernetes#43948 ```release-note Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore. ```

justinsb closed this as completed Apr 1, 2017

justinsb changed the title ~~Unable to revert deployment~~ Unable to revert deployment to previous version Apr 1, 2017

justinsb reopened this Apr 1, 2017

0xmichalis added area/workload-api/deployment sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 2, 2017

This was referenced Apr 2, 2017

Update retries to a saner count openshift/origin#13603

Merged

Update deployment retries to a saner count #43965

Merged

[1.6] Update deployment retries to a saner count #43978

Merged

[1.5] Update deployment retries to a saner count #43979

Merged

0xmichalis added the kind/bug Categorizes issue or PR as related to a bug. label Apr 4, 2017

0xmichalis mentioned this issue Apr 26, 2017

Switch Deployments to new hashing algo w/ collision avoidance mechanism #44774

Merged

k8s-github-robot pushed a commit that referenced this issue May 2, 2017

Merge pull request #43979 from kargakis/cherry-pick-to-15

fcafbe7

Automatic merge from submit-queue [1.5] Update deployment retries to a saner count Safe-guard for failures like #43948

k8s-github-robot closed this as completed in #44774 May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to revert deployment to previous version #43948

Unable to revert deployment to previous version #43948

justinsb commented Apr 1, 2017 •

edited

Loading

0xmichalis commented Apr 1, 2017

0xmichalis commented Apr 1, 2017

0xmichalis commented Apr 2, 2017

0xmichalis commented Apr 2, 2017

justinsb commented Apr 4, 2017

justinsb commented Apr 4, 2017

0xmichalis commented Apr 4, 2017

0xmichalis commented Apr 4, 2017 •

edited

Loading

0xmichalis commented Apr 4, 2017

0xmichalis commented Apr 4, 2017

Unable to revert deployment to previous version #43948

Unable to revert deployment to previous version #43948

Comments

justinsb commented Apr 1, 2017 • edited Loading

0xmichalis commented Apr 1, 2017

0xmichalis commented Apr 1, 2017

0xmichalis commented Apr 2, 2017

0xmichalis commented Apr 2, 2017

justinsb commented Apr 4, 2017

justinsb commented Apr 4, 2017

0xmichalis commented Apr 4, 2017

0xmichalis commented Apr 4, 2017 • edited Loading

0xmichalis commented Apr 4, 2017

0xmichalis commented Apr 4, 2017

justinsb commented Apr 1, 2017 •

edited

Loading

0xmichalis commented Apr 4, 2017 •

edited

Loading