Kops 1.12-alpha Calico upgrade #6636

sstarcher · 2019-03-18T20:35:57Z

1. What kops version are you running? The command kops version, will display
this information.

Version 1.12.0-alpha.1 (git-511a44c67)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T06:59:37Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --yes && kops rolling-update --yes

5. What happened after the commands executed?

Calico fails

6. What did you expect to happen?

Calico to be migrated to version 3

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  name: shane.dev.example.com
spec:
  additionalPolicies:
    master: |
      [{
        "Effect": "Allow",
        "Action": [
            "autoscaling:DescribeAutoScalingInstances"
        ],
        "Resource": [
            "*"
        ],
        "Condition": {
            "StringEquals": {
                "autoscaling:ResourceTag/KubernetesCluster": "shane.dev.example.com"
            }
        }
      }]
    node: |
      [{
          "Effect": "Allow",
          "Action": ["sts:AssumeRole"],
          "Resource": ["*"]
      },{
          "Effect": "Deny",
          "Action": ["sts:AssumeRole"],
          "Resource": ["arn:aws:iam:::role/Admin"]
      }]
  api:
    loadBalancer:
      idleTimeoutSeconds: 4000
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    disableSecurityGroupIngress: true
  cloudProvider: aws
  configBase: s3://example-terraform/dev/kops/shane.dev.example.com
  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
    provider: Legacy
    version: 3.2.24
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
    provider: Legacy
    version: 3.2.24
  fileAssets:
  - content: |
      # https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml
      # https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L735
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
        # The following requests were manually identified as high-volume and low-risk,
        # so drop them.
        - level: None
          users: ["system:kube-proxy"]
          verbs: ["watch"]
          resources:
            - group: "" # core
              resources: ["endpoints", "services", "services/status"]
        - level: None
          # Ingress controller reads 'configmaps/ingress-uid' through the unsecured port.
          # TODO(#46983): Change this to the ingress controller service account.
          users: ["system:unsecured"]
          namespaces: ["kube-system"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["configmaps"]
        - level: None
          users: ["kubelet"] # legacy kubelet identity
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["nodes", "nodes/status"]
        - level: None
          userGroups: ["system:nodes"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["nodes", "nodes/status"]
        - level: None
          users:
            - system:kube-controller-manager
            - system:kube-scheduler
            - system:serviceaccount:kube-system:endpoint-controller
          verbs: ["get", "update"]
          namespaces: ["kube-system"]
          resources:
            - group: "" # core
              resources: ["endpoints"]
        - level: None
          users: ["system:apiserver"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["namespaces", "namespaces/status", "namespaces/finalize"]
        - level: None
          users: ["cluster-autoscaler"]
          verbs: ["get", "update"]
          namespaces: ["kube-system"]
          resources:
            - group: "" # core
              resources: ["configmaps", "endpoints"]
        # Don't log HPA fetching metrics.
        - level: None
          users:
            - system:kube-controller-manager
          verbs: ["get", "list"]
          resources:
            - group: "metrics.k8s.io"
        # Don't log these read-only URLs.
        - level: None
          nonResourceURLs:
            - /healthz*
            - /version
            - /swagger*
        # Don't log events requests.
        - level: None
          resources:
            - group: "" # core
              resources: ["events"]
        # node and pod status calls from nodes are high-volume and can be large, don't log responses for expected updates from nodes
        - level: Request
          users: ["kubelet", "system:node-problem-detector", "system:serviceaccount:kube-system:node-problem-detector"]
          verbs: ["update","patch"]
          resources:
            - group: "" # core
              resources: ["nodes/status", "pods/status"]
          omitStages:
            - "RequestReceived"
        - level: Request
          userGroups: ["system:nodes"]
          verbs: ["update","patch"]
          resources:
            - group: "" # core
              resources: ["nodes/status", "pods/status"]
          omitStages:
            - "RequestReceived"
        # deletecollection calls can be large, don't log responses for expected namespace deletions
        - level: Request
          users: ["system:serviceaccount:kube-system:namespace-controller"]
          verbs: ["deletecollection"]
          omitStages:
            - "RequestReceived"
        # Secrets, ConfigMaps, and TokenReviews can contain sensitive & binary data,
        # so only log at the Metadata level.
        - level: Metadata
          resources:
            - group: "" # core
              resources: ["secrets", "configmaps"]
            - group: authentication.k8s.io
              resources: ["tokenreviews"]
          omitStages:
            - "RequestReceived"
        # A catch-all rule to log all other requests at the Metadata level.
        - level: Metadata
          # Long-running requests like watches that fall under this rule will not
          # generate an audit event in RequestReceived.
          omitStages:
            - "RequestReceived"
    name: audit.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    auditLogPath: '-'
    auditPolicyFile: /srv/kubernetes/assets/audit.yaml
    oidcClientID: kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.dev.example.com
    oidcUsernameClaim: email
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.4
  masterInternalName: api.internal.shane.dev.example.com
  masterPublicName: api.shane.dev.example.com
  networkCIDR: 10.40.0.0/16
  networkID: vpc-4dd43e34
  networking:
    calico:
      majorVersion: v3
  nodePortAccess:
  - 10.40.0.0/16
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - cidr: 10.40.0.0/22
    id: subnet-0490c74c
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 10.40.4.0/22
    id: subnet-f910279f
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 10.40.8.0/22
    id: subnet-f608e4ac
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 10.40.128.0/22
    id: subnet-c29ec98a
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 10.40.132.0/22
    id: subnet-911027f7
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 10.40.136.0/22
    id: subnet-f708e4ad
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private
  updatePolicy: external

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2b
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2b
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2c
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2a
  role: Node
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2b
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2b
  role: Node
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2c
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2c
  role: Node
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-west-cpu-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/cpu: ""
    k8s.io/cluster-autoscaler/node-template/taint/cpu: ""
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: c5.4xlarge
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-west-cpu-2a
  role: Node
  subnets:
  - us-west-2a
  taints:
  - role=cpu:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-west-mem-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/mem: ""
    k8s.io/cluster-autoscaler/node-template/taint/mem: ""
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: r4.4xlarge
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-west-mem-2a
  role: Node
  subnets:
  - us-west-2a
  taints:
  - role=mem:NoSchedule

9. Anything else do we need to know?

I build from master at this sha - 511a44c
I initially ran kops cluster upgrade without changing calico majorVersion to v3.
Calico node controller and the new calico pod failed due to the removal of etcd_endpoints from the config map
I modified the cluster manifest to change it major version v3 and ran kops cluster upgrade again and killed the master node that was failing.
Afterwards I still get the same issue with the node controller and the etcd_endpoints.

The text was updated successfully, but these errors were encountered:

sstarcher · 2019-03-23T22:52:27Z

@justinsb can we block the next release on this. Unless I did something wrong.

justinsb · 2019-03-26T21:24:59Z

This would block the release, but I don't think we should block the alpha - not everyone will build from source.

The calico upgrade with etcd3 is disruptive though: https://github.com/kubernetes/kops/blob/63943277bc48f1faa7cc773c1c7b2d8127c4f9b3/docs/etcd3-migration.md

Did a kops rolling-update cluster work? It's (sadly) not expected that a standard rolling-update will work

sstarcher · 2019-03-26T21:50:31Z

Agreed it should not block alpha let me circle back around and test this again.

justinsb · 2019-03-26T23:33:55Z

So we chatted on slack, and the problematic configuration seems to be k8s 1.12 + kops 1.11 + etcd3 + legacy + calico. After an upgrade to kops 1.12 it looks like there are two problems:

calico update is applied "on top", but this means we keep some references which don't exist in the new version, for example CALICO_ETCD_ENDPOINTS pointing to a config map (it looks like)
etcd-manager import of the existing cluster doesn't like the https scheme error initializing etcd server: scheme not yet implemented: "https://etcd-events-1.internal.calico.awsdata.com:2381"

zachaller · 2019-03-27T01:17:49Z

We are running into number 2 as well but are not using calico we are using weave and just upgrading from etcd 3 with tls (legacy) to etcd v3 (Manager)

mikesplain · 2019-03-28T12:46:50Z

We are as well, I'm going to do some additional testing today and tomorrow.

sstarcher · 2019-03-29T21:23:49Z

Fixed in 1.12 alpha3 by #6682
It does require a rolling update to the masters all at once.

justinsb · 2019-03-30T04:38:54Z

Sorry for not updating this - #6682 should have fixed part 1 (calico manifest was broken), #6695 should have fixed part 2 (moving tls-etcd to etcd-manager)

zachaller mentioned this issue Mar 27, 2019

kops 1.12 etcd3 migration with current etcd3 cluster #6690

Closed

sstarcher closed this as completed Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kops 1.12-alpha Calico upgrade #6636

Kops 1.12-alpha Calico upgrade #6636

sstarcher commented Mar 18, 2019

sstarcher commented Mar 23, 2019

justinsb commented Mar 26, 2019

sstarcher commented Mar 26, 2019

justinsb commented Mar 26, 2019

zachaller commented Mar 27, 2019

mikesplain commented Mar 28, 2019

sstarcher commented Mar 29, 2019

justinsb commented Mar 30, 2019

Kops 1.12-alpha Calico upgrade #6636

Kops 1.12-alpha Calico upgrade #6636

Comments

sstarcher commented Mar 18, 2019

sstarcher commented Mar 23, 2019

justinsb commented Mar 26, 2019

sstarcher commented Mar 26, 2019

justinsb commented Mar 26, 2019

zachaller commented Mar 27, 2019

mikesplain commented Mar 28, 2019

sstarcher commented Mar 29, 2019

justinsb commented Mar 30, 2019