Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops-controller stale node label values #10185

Closed
trajakovic opened this issue Nov 6, 2020 · 5 comments
Closed

kops-controller stale node label values #10185

trajakovic opened this issue Nov 6, 2020 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@trajakovic
Copy link

1. What kops version are you running? The command kops version, will display
this information.

Version 1.18.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"62876fc6d93e891aa7fbe19771e6a6c03773b0f7", GitTreeState:"clean", BuildDate:"2020-10-15T01:43:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

  1. create instancegroup (node) test1 and add
nodeLabels:
  test: test
  1. apply it to kops cluster and update cluster
  • node spawns with label test=test
  1. edit instance group, and change
    nodeLabels:
       test: changed-test
    
  2. apply it to kops cluster and update cluster and do rolling-update cluster

5. What happened after the commands executed?

Newly spawned nodes still have "old" value for label test=test.

  • labels on AutoscalingGroup are updated correctly

6. What did you expect to happen?

Expected nodes with correct/new labels.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-06-13T08:01:41Z"
  generation: 5
  name: production.REDACTED.k8s.local
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "route53:ChangeResourceRecordSets"
          ],
          "Resource": [
            "arn:aws:route53:::hostedzone/*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:AttachLoadBalancers",
            "autoscaling:DetachLoadBalancers",
            "autoscaling:DetachLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancerTargetGroups",
            "autoscaling:DescribeLoadBalancerTargetGroups",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeTags",
            "autoscaling:SetDesiredCapacity",
            "route53:ListHostedZones",
            "route53:ListResourceRecordSets",
            "route53:ListTagsForResource"
          ],
          "Resource": [
            "*"
          ]
        }
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "acm:ListCertificates",
            "acm:DescribeCertificate",
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeLoadBalancerTargetGroups",
            "autoscaling:DescribeTags",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:AttachLoadBalancers",
            "autoscaling:DetachLoadBalancers",
            "autoscaling:DetachLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancerTargetGroups",
            "cloudformation:*",
            "elasticloadbalancing:*",
            "ec2:DescribeInstances",
            "ec2:DescribeSubnets",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeRouteTables",
            "ec2:DescribeVpcs",
            "iam:GetServerCertificate",
            "iam:ListServerCertificates",
            "route53:ListHostedZones",
            "route53:ListResourceRecordSets",
            "route53:ListTagsForResource"
          ],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": [
            "route53:ChangeResourceRecordSets"
          ],
          "Resource": [
            "arn:aws:route53:::hostedzone/*"
          ]
        }
      ]
  api:
    loadBalancer:
      idleTimeoutSeconds: 3600
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    kubernetes.io/cluster/production.REDACTED.k8s.local: owned
  cloudProvider: aws
  configBase: s3://production.REDACTED.k8s.local-cluster-state-store/production.REDACTED.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.18.10
  masterInternalName: api.internal.production.REDACTED.k8s.local
  masterPublicName: api.production.REDACTED.k8s.local
  networkCIDR: 10.20.0.0/16
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  rollingUpdate:
    maxSurge: 100%
  sshAccess:
  - 10.20.0.0/16
  subnets:
  - cidr: 10.20.32.0/19
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.20.64.0/19
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.20.96.0/19
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.20.0.0/22
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.20.4.0/22
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.20.8.0/22
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    bastion:
      bastionPublicName: bastion.REDACTED.k8s.local
      idleTimeoutSeconds: 1200
    dns:
      type: Public
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

N/A

9. Anything else do we need to know?

By looking at (leader) kops-controller logs, it seems that it's unaware of node label changes.
By deleting leader pod, new leader started to patching nodes with new labels.

I'm unaware of any configurations regarding AWS metadata / kops resources refresh for kops-controller, so my wild guess is that kops-controller is unaware of label changes (probably reading state file from s3 at the beginning of it's leader mandate).

@johngmyers
Copy link
Member

I suspect this was fixed by #9575

@dnalencastre
Copy link

I'm getting the same behaviour in most of my attempts , with the caveat that the labels do eventually change.

Once I discovered that the labels eventually changed (by accident), I measured it to take 52 minutes after the first replacement node became available according to kubectl get nodes.
These 52m time was consistent across 3 different attempts.

Further, to add to the confusion, I did get an attempt in which the labels changed within around 15 minutes (can't be more precise, as I wasn't measuring times in that attempt).

  1. What kops version are you running? The command kops version, will display
    this information.

Version 1.18.2 (git-84495481e4)

  1. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:17:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.12", GitCommit:"7cd5e9086de8ae25d6a1514d0c87bac67ca4a481", GitTreeState:"clean", BuildDate:"2020-11-12T09:11:15Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?

On an instance group with the labels
ig_base_label_key_01 : ig_base_label_val_01

edit the instance group to the value to ig_base_label_val_02

Run kops rolling-update cluster --yes

5 What happened after the commands executed?

Newly spawned nodes still have the previous value for the label

ig_base_label_key_01=ig_base_label_val_01

  • labels on AutoscalingGroup are updated correctly
  1. What did you expect to happen?

Expected nodes with correct/new labels, i.e. ig_base_label_key_01=ig_base_label_val_02

  1. Please provide your cluster manifest. Execute
    kops get --name my.example.com -o yaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-12-21T10:34:15Z"
  name: my-test-cluster.k8s.local
spec:
  api:
    loadBalancer:
      securityGroupOverride: sg-AAAAAAAAAAAAAA
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://nope-my-test-cluster/my-test-cluster.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-central-1c-1
      name: "1"
    - instanceGroup: master-eu-central-1b-2
      name: "2"
    - instanceGroup: master-eu-central-1a-3
      name: "3"
    name: main
    version: 3.2.24
  - etcdMembers:
    - instanceGroup: master-eu-central-1c-1
      name: "1"
    - instanceGroup: master-eu-central-1b-2
      name: "2"
    - instanceGroup: master-eu-central-1a-3
      name: "3"
    name: events
    version: 3.2.24
  fileAssets:
  - content: |
      PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
      DOCKER_OPTS="--ip-masq=false --iptables=false --log-driver=json-file --log-level=warn --log-opt=max-file=5 --log-opt=max-size=10m --storage-driver=overlay2 --max-concurrent-downloads=10"
    name: etc-env-dockerd-config
    path: /etc/environment
  hooks:
  - manifest: |
      [Unit]
      Description=Save and load common docker images
      Before=kubelet.service
      [Service]
      EnvironmentFile=/etc/environment
      ExecStartPre=/usr/bin/docker image save k8s.gcr.io/pause-amd64:3.0 -o /opt/preloaded_docker_images.tar
      ExecStart=/usr/bin/docker image load -i /opt/preloaded_docker_images.tar
      ExecStop=
    name: docker-image-preload
    useRawManifest: true
  - before:
    - kubelet.service
    manifest: |
      [Service]
      Type=oneshot
      RemainAfterExit=no
      ExecStart=/bin/sh -c "sed -i -- 's/pool/#pool/g' /etc/ntp.conf ; echo 'server 169.254.169.123 prefer iburst' >> /etc/ntp.conf"
      ExecStartPost=/bin/systemctl restart ntp.service
    name: change_ntp_server.service
    roles:
    - Node
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    imagePullProgressDeadline: 30m0s
    serializeImagePulls: true
  kubernetesApiAccess:
  - x.x.x.0/9
  - y.y.0.0/12
  kubernetesVersion: 1.18.12
  masterInternalName: api.internal.my-test-cluster.k8s.local
  masterPublicName: api.my-test-cluster.k8s.local
  networkCIDR: x.x.x.0/20
  networkID: vpc-01e6cdc5e0fd3e7e7
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: x.x.4.0/23
    id: subnet-AAAAAAAAAAAAA
    name: subnet-AAAAAAAAAAAAA
    type: Private
    zone: eu-central-1c
  - cidr: x.x.4.0/23
    id: subnet-AAAAAAAAAAAAA
    name: utility-subnet-AAAAAAAAAAAAA
    type: Utility
    zone: eu-central-1c
  - cidr: x.x.2.0/23
    id: subnet-BBBBBBBBBB
    name: subnet-BBBBBBBBBB
    type: Private
    zone: eu-central-1b
  - cidr: x.x.2.0/23
    id: subnet-BBBBBBBBBB
    name: utility-subnet-BBBBBBBBBB
    type: Utility
    zone: eu-central-1b
  - cidr: x.x.x.0/23
    id: subnet-CCCCCCCCCCCC
    name: subnet-CCCCCCCCCCCC
    type: Private
    zone: eu-central-1a
  - cidr: x.x.x.0/23
    id: subnet-CCCCCCCCCCCC
    name: utility-subnet-CCCCCCCCCCCC
    type: Utility
    zone: eu-central-1a
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:53:56Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: base
spec:
  additionalSecurityGroups:
  - sg-BBBBBBBBB
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-base
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    ig_base_label_key_01: ig_base_label_val_02
    kops.k8s.io/instancegroup: base
  role: Node
  rootVolumeSize: 30
  securityGroupOverride: sg-CCCCCCCCCCCC
  subnets:
  - subnet-AAAAAAAAAAAAA
  - subnet-BBBBBBBBBB
  - subnet-CCCCCCCCCCCC

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: ig_custom01
spec:
  additionalSecurityGroups:
  - sg-BBBBBBBBB
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-ig_custom01
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 0
  minSize: 0
  nodeLabels:
    dedicated: ig_custom01
    ig_custom01_label_key_01: ig_custom01_label_val_01
    kops.k8s.io/instancegroup: ig_custom01
  role: Node
  rootVolumeSize: 30
  securityGroupOverride: sg-CCCCCCCCCCCC
  subnets:
  - subnet-AAAAAAAAAAAAA
  taints:
  - dedicated=ig_custom01:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1a-3
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1a-3
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-CCCCCCCCCCCC

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1b-2
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1b-2
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-BBBBBBBBBB

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1c-1
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1c-1
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-AAAAAAAAAAAAA

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2021
@johngmyers
Copy link
Member

Fixed in 1.20 by #9575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants