Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kops 1.12-alpha Calico upgrade #6636

Closed
sstarcher opened this issue Mar 18, 2019 · 8 comments
Closed

Kops 1.12-alpha Calico upgrade #6636

sstarcher opened this issue Mar 18, 2019 · 8 comments

Comments

@sstarcher
Copy link
Contributor

1. What kops version are you running? The command kops version, will display
this information.

Version 1.12.0-alpha.1 (git-511a44c67)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T06:59:37Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --yes && kops rolling-update --yes

5. What happened after the commands executed?

Calico fails

6. What did you expect to happen?

Calico to be migrated to version 3

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  name: shane.dev.example.com
spec:
  additionalPolicies:
    master: |
      [{
        "Effect": "Allow",
        "Action": [
            "autoscaling:DescribeAutoScalingInstances"
        ],
        "Resource": [
            "*"
        ],
        "Condition": {
            "StringEquals": {
                "autoscaling:ResourceTag/KubernetesCluster": "shane.dev.example.com"
            }
        }
      }]
    node: |
      [{
          "Effect": "Allow",
          "Action": ["sts:AssumeRole"],
          "Resource": ["*"]
      },{
          "Effect": "Deny",
          "Action": ["sts:AssumeRole"],
          "Resource": ["arn:aws:iam:::role/Admin"]
      }]
  api:
    loadBalancer:
      idleTimeoutSeconds: 4000
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    disableSecurityGroupIngress: true
  cloudProvider: aws
  configBase: s3://example-terraform/dev/kops/shane.dev.example.com
  etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
    provider: Legacy
    version: 3.2.24
  - enableEtcdTLS: true
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
    provider: Legacy
    version: 3.2.24
  fileAssets:
  - content: |
      # https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml
      # https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L735
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
        # The following requests were manually identified as high-volume and low-risk,
        # so drop them.
        - level: None
          users: ["system:kube-proxy"]
          verbs: ["watch"]
          resources:
            - group: "" # core
              resources: ["endpoints", "services", "services/status"]
        - level: None
          # Ingress controller reads 'configmaps/ingress-uid' through the unsecured port.
          # TODO(#46983): Change this to the ingress controller service account.
          users: ["system:unsecured"]
          namespaces: ["kube-system"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["configmaps"]
        - level: None
          users: ["kubelet"] # legacy kubelet identity
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["nodes", "nodes/status"]
        - level: None
          userGroups: ["system:nodes"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["nodes", "nodes/status"]
        - level: None
          users:
            - system:kube-controller-manager
            - system:kube-scheduler
            - system:serviceaccount:kube-system:endpoint-controller
          verbs: ["get", "update"]
          namespaces: ["kube-system"]
          resources:
            - group: "" # core
              resources: ["endpoints"]
        - level: None
          users: ["system:apiserver"]
          verbs: ["get"]
          resources:
            - group: "" # core
              resources: ["namespaces", "namespaces/status", "namespaces/finalize"]
        - level: None
          users: ["cluster-autoscaler"]
          verbs: ["get", "update"]
          namespaces: ["kube-system"]
          resources:
            - group: "" # core
              resources: ["configmaps", "endpoints"]
        # Don't log HPA fetching metrics.
        - level: None
          users:
            - system:kube-controller-manager
          verbs: ["get", "list"]
          resources:
            - group: "metrics.k8s.io"
        # Don't log these read-only URLs.
        - level: None
          nonResourceURLs:
            - /healthz*
            - /version
            - /swagger*
        # Don't log events requests.
        - level: None
          resources:
            - group: "" # core
              resources: ["events"]
        # node and pod status calls from nodes are high-volume and can be large, don't log responses for expected updates from nodes
        - level: Request
          users: ["kubelet", "system:node-problem-detector", "system:serviceaccount:kube-system:node-problem-detector"]
          verbs: ["update","patch"]
          resources:
            - group: "" # core
              resources: ["nodes/status", "pods/status"]
          omitStages:
            - "RequestReceived"
        - level: Request
          userGroups: ["system:nodes"]
          verbs: ["update","patch"]
          resources:
            - group: "" # core
              resources: ["nodes/status", "pods/status"]
          omitStages:
            - "RequestReceived"
        # deletecollection calls can be large, don't log responses for expected namespace deletions
        - level: Request
          users: ["system:serviceaccount:kube-system:namespace-controller"]
          verbs: ["deletecollection"]
          omitStages:
            - "RequestReceived"
        # Secrets, ConfigMaps, and TokenReviews can contain sensitive & binary data,
        # so only log at the Metadata level.
        - level: Metadata
          resources:
            - group: "" # core
              resources: ["secrets", "configmaps"]
            - group: authentication.k8s.io
              resources: ["tokenreviews"]
          omitStages:
            - "RequestReceived"
        # A catch-all rule to log all other requests at the Metadata level.
        - level: Metadata
          # Long-running requests like watches that fall under this rule will not
          # generate an audit event in RequestReceived.
          omitStages:
            - "RequestReceived"
    name: audit.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    auditLogPath: '-'
    auditPolicyFile: /srv/kubernetes/assets/audit.yaml
    oidcClientID: kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://dex.dev.example.com
    oidcUsernameClaim: email
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.4
  masterInternalName: api.internal.shane.dev.example.com
  masterPublicName: api.shane.dev.example.com
  networkCIDR: 10.40.0.0/16
  networkID: vpc-4dd43e34
  networking:
    calico:
      majorVersion: v3
  nodePortAccess:
  - 10.40.0.0/16
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - cidr: 10.40.0.0/22
    id: subnet-0490c74c
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 10.40.4.0/22
    id: subnet-f910279f
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 10.40.8.0/22
    id: subnet-f608e4ac
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 10.40.128.0/22
    id: subnet-c29ec98a
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 10.40.132.0/22
    id: subnet-911027f7
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 10.40.136.0/22
    id: subnet-f708e4ad
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private
  updatePolicy: external

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2b
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2b
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: master-us-west-2c
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2c
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2a
  role: Node
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:55Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2b
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2b
  role: Node
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-us-west-2c
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: m4.xlarge
  maxSize: 6
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2c
  role: Node
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-west-cpu-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/cpu: ""
    k8s.io/cluster-autoscaler/node-template/taint/cpu: ""
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: c5.4xlarge
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-west-cpu-2a
  role: Node
  subnets:
  - us-west-2a
  taints:
  - role=cpu:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-03-18T17:45:56Z
  labels:
    kops.k8s.io/cluster: shane.dev.example.com
  name: nodes-west-mem-2a
spec:
  additionalSecurityGroups:
  - sg-661c951a
  cloudLabels:
    customer: internal
    environment: dev
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/node-template/label/mem: ""
    k8s.io/cluster-autoscaler/node-template/taint/mem: ""
    kubernetes.io/cluster/shane.dev.example.com: ""
    service: kubernetes
    team: is-prod-down
  image: ami-010b9028f89a81d66
  machineType: r4.4xlarge
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-west-mem-2a
  role: Node
  subnets:
  - us-west-2a
  taints:
  - role=mem:NoSchedule

9. Anything else do we need to know?

  • I build from master at this sha - 511a44c
  • I initially ran kops cluster upgrade without changing calico majorVersion to v3.
  • Calico node controller and the new calico pod failed due to the removal of etcd_endpoints from the config map
  • I modified the cluster manifest to change it major version v3 and ran kops cluster upgrade again and killed the master node that was failing.
  • Afterwards I still get the same issue with the node controller and the etcd_endpoints.
@sstarcher
Copy link
Contributor Author

@justinsb can we block the next release on this. Unless I did something wrong.

@justinsb
Copy link
Member

This would block the release, but I don't think we should block the alpha - not everyone will build from source.

The calico upgrade with etcd3 is disruptive though: https://github.com/kubernetes/kops/blob/63943277bc48f1faa7cc773c1c7b2d8127c4f9b3/docs/etcd3-migration.md

Did a kops rolling-update cluster work? It's (sadly) not expected that a standard rolling-update will work

@sstarcher
Copy link
Contributor Author

Agreed it should not block alpha let me circle back around and test this again.

@justinsb
Copy link
Member

So we chatted on slack, and the problematic configuration seems to be k8s 1.12 + kops 1.11 + etcd3 + legacy + calico. After an upgrade to kops 1.12 it looks like there are two problems:

  1. calico update is applied "on top", but this means we keep some references which don't exist in the new version, for example CALICO_ETCD_ENDPOINTS pointing to a config map (it looks like)
  2. etcd-manager import of the existing cluster doesn't like the https scheme error initializing etcd server: scheme not yet implemented: "https://etcd-events-1.internal.calico.awsdata.com:2381"

@zachaller
Copy link
Contributor

We are running into number 2 as well but are not using calico we are using weave and just upgrading from etcd 3 with tls (legacy) to etcd v3 (Manager)

@mikesplain
Copy link
Contributor

We are as well, I'm going to do some additional testing today and tomorrow.

@sstarcher
Copy link
Contributor Author

Fixed in 1.12 alpha3 by #6682
It does require a rolling update to the masters all at once.

@justinsb
Copy link
Member

Sorry for not updating this - #6682 should have fixed part 1 (calico manifest was broken), #6695 should have fixed part 2 (moving tls-etcd to etcd-manager)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants