kops-controller stale node label values #10185

trajakovic · 2020-11-06T20:17:48Z

1. What kops version are you running? The command kops version, will display
this information.

Version 1.18.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.10", GitCommit:"62876fc6d93e891aa7fbe19771e6a6c03773b0f7", GitTreeState:"clean", BuildDate:"2020-10-15T01:43:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

create instancegroup (node) test1 and add

nodeLabels:
  test: test

apply it to kops cluster and update cluster

node spawns with label test=test

edit instance group, and change
```
nodeLabels:
   test: changed-test
```
apply it to kops cluster and update cluster and do rolling-update cluster

5. What happened after the commands executed?

Newly spawned nodes still have "old" value for label test=test.

labels on AutoscalingGroup are updated correctly

6. What did you expect to happen?

Expected nodes with correct/new labels.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-06-13T08:01:41Z"
  generation: 5
  name: production.REDACTED.k8s.local
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "route53:ChangeResourceRecordSets"
          ],
          "Resource": [
            "arn:aws:route53:::hostedzone/*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:AttachLoadBalancers",
            "autoscaling:DetachLoadBalancers",
            "autoscaling:DetachLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancerTargetGroups",
            "autoscaling:DescribeLoadBalancerTargetGroups",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeTags",
            "autoscaling:SetDesiredCapacity",
            "route53:ListHostedZones",
            "route53:ListResourceRecordSets",
            "route53:ListTagsForResource"
          ],
          "Resource": [
            "*"
          ]
        }
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "acm:ListCertificates",
            "acm:DescribeCertificate",
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeLoadBalancerTargetGroups",
            "autoscaling:DescribeTags",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:TerminateInstanceInAutoScalingGroup",
            "autoscaling:AttachLoadBalancers",
            "autoscaling:DetachLoadBalancers",
            "autoscaling:DetachLoadBalancerTargetGroups",
            "autoscaling:AttachLoadBalancerTargetGroups",
            "cloudformation:*",
            "elasticloadbalancing:*",
            "ec2:DescribeInstances",
            "ec2:DescribeSubnets",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeRouteTables",
            "ec2:DescribeVpcs",
            "iam:GetServerCertificate",
            "iam:ListServerCertificates",
            "route53:ListHostedZones",
            "route53:ListResourceRecordSets",
            "route53:ListTagsForResource"
          ],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": [
            "route53:ChangeResourceRecordSets"
          ],
          "Resource": [
            "arn:aws:route53:::hostedzone/*"
          ]
        }
      ]
  api:
    loadBalancer:
      idleTimeoutSeconds: 3600
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    kubernetes.io/cluster/production.REDACTED.k8s.local: owned
  cloudProvider: aws
  configBase: s3://production.REDACTED.k8s.local-cluster-state-store/production.REDACTED.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.18.10
  masterInternalName: api.internal.production.REDACTED.k8s.local
  masterPublicName: api.production.REDACTED.k8s.local
  networkCIDR: 10.20.0.0/16
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  rollingUpdate:
    maxSurge: 100%
  sshAccess:
  - 10.20.0.0/16
  subnets:
  - cidr: 10.20.32.0/19
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.20.64.0/19
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.20.96.0/19
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.20.0.0/22
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.20.4.0/22
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.20.8.0/22
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    bastion:
      bastionPublicName: bastion.REDACTED.k8s.local
      idleTimeoutSeconds: 1200
    dns:
      type: Public
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

N/A

9. Anything else do we need to know?

By looking at (leader) kops-controller logs, it seems that it's unaware of node label changes.
By deleting leader pod, new leader started to patching nodes with new labels.

I'm unaware of any configurations regarding AWS metadata / kops resources refresh for kops-controller, so my wild guess is that kops-controller is unaware of label changes (probably reading state file from s3 at the beginning of it's leader mandate).

The text was updated successfully, but these errors were encountered:

johngmyers · 2020-11-07T00:38:34Z

I suspect this was fixed by #9575

dnalencastre · 2020-12-21T11:15:15Z

I'm getting the same behaviour in most of my attempts , with the caveat that the labels do eventually change.

Once I discovered that the labels eventually changed (by accident), I measured it to take 52 minutes after the first replacement node became available according to kubectl get nodes.
These 52m time was consistent across 3 different attempts.

Further, to add to the confusion, I did get an attempt in which the labels changed within around 15 minutes (can't be more precise, as I wasn't measuring times in that attempt).

What kops version are you running? The command kops version, will display
this information.

Version 1.18.2 (git-84495481e4)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:17:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.12", GitCommit:"7cd5e9086de8ae25d6a1514d0c87bac67ca4a481", GitTreeState:"clean", BuildDate:"2020-11-12T09:11:15Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

On an instance group with the labels
ig_base_label_key_01 : ig_base_label_val_01

edit the instance group to the value to ig_base_label_val_02

Run kops rolling-update cluster --yes

5 What happened after the commands executed?

Newly spawned nodes still have the previous value for the label

ig_base_label_key_01=ig_base_label_val_01

labels on AutoscalingGroup are updated correctly

What did you expect to happen?

Expected nodes with correct/new labels, i.e. ig_base_label_key_01=ig_base_label_val_02

Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-12-21T10:34:15Z"
  name: my-test-cluster.k8s.local
spec:
  api:
    loadBalancer:
      securityGroupOverride: sg-AAAAAAAAAAAAAA
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://nope-my-test-cluster/my-test-cluster.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-central-1c-1
      name: "1"
    - instanceGroup: master-eu-central-1b-2
      name: "2"
    - instanceGroup: master-eu-central-1a-3
      name: "3"
    name: main
    version: 3.2.24
  - etcdMembers:
    - instanceGroup: master-eu-central-1c-1
      name: "1"
    - instanceGroup: master-eu-central-1b-2
      name: "2"
    - instanceGroup: master-eu-central-1a-3
      name: "3"
    name: events
    version: 3.2.24
  fileAssets:
  - content: |
      PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
      DOCKER_OPTS="--ip-masq=false --iptables=false --log-driver=json-file --log-level=warn --log-opt=max-file=5 --log-opt=max-size=10m --storage-driver=overlay2 --max-concurrent-downloads=10"
    name: etc-env-dockerd-config
    path: /etc/environment
  hooks:
  - manifest: |
      [Unit]
      Description=Save and load common docker images
      Before=kubelet.service
      [Service]
      EnvironmentFile=/etc/environment
      ExecStartPre=/usr/bin/docker image save k8s.gcr.io/pause-amd64:3.0 -o /opt/preloaded_docker_images.tar
      ExecStart=/usr/bin/docker image load -i /opt/preloaded_docker_images.tar
      ExecStop=
    name: docker-image-preload
    useRawManifest: true
  - before:
    - kubelet.service
    manifest: |
      [Service]
      Type=oneshot
      RemainAfterExit=no
      ExecStart=/bin/sh -c "sed -i -- 's/pool/#pool/g' /etc/ntp.conf ; echo 'server 169.254.169.123 prefer iburst' >> /etc/ntp.conf"
      ExecStartPost=/bin/systemctl restart ntp.service
    name: change_ntp_server.service
    roles:
    - Node
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    imagePullProgressDeadline: 30m0s
    serializeImagePulls: true
  kubernetesApiAccess:
  - x.x.x.0/9
  - y.y.0.0/12
  kubernetesVersion: 1.18.12
  masterInternalName: api.internal.my-test-cluster.k8s.local
  masterPublicName: api.my-test-cluster.k8s.local
  networkCIDR: x.x.x.0/20
  networkID: vpc-01e6cdc5e0fd3e7e7
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: x.x.4.0/23
    id: subnet-AAAAAAAAAAAAA
    name: subnet-AAAAAAAAAAAAA
    type: Private
    zone: eu-central-1c
  - cidr: x.x.4.0/23
    id: subnet-AAAAAAAAAAAAA
    name: utility-subnet-AAAAAAAAAAAAA
    type: Utility
    zone: eu-central-1c
  - cidr: x.x.2.0/23
    id: subnet-BBBBBBBBBB
    name: subnet-BBBBBBBBBB
    type: Private
    zone: eu-central-1b
  - cidr: x.x.2.0/23
    id: subnet-BBBBBBBBBB
    name: utility-subnet-BBBBBBBBBB
    type: Utility
    zone: eu-central-1b
  - cidr: x.x.x.0/23
    id: subnet-CCCCCCCCCCCC
    name: subnet-CCCCCCCCCCCC
    type: Private
    zone: eu-central-1a
  - cidr: x.x.x.0/23
    id: subnet-CCCCCCCCCCCC
    name: utility-subnet-CCCCCCCCCCCC
    type: Utility
    zone: eu-central-1a
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:53:56Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: base
spec:
  additionalSecurityGroups:
  - sg-BBBBBBBBB
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-base
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    ig_base_label_key_01: ig_base_label_val_02
    kops.k8s.io/instancegroup: base
  role: Node
  rootVolumeSize: 30
  securityGroupOverride: sg-CCCCCCCCCCCC
  subnets:
  - subnet-AAAAAAAAAAAAA
  - subnet-BBBBBBBBBB
  - subnet-CCCCCCCCCCCC

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: ig_custom01
spec:
  additionalSecurityGroups:
  - sg-BBBBBBBBB
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-ig_custom01
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 0
  minSize: 0
  nodeLabels:
    dedicated: ig_custom01
    ig_custom01_label_key_01: ig_custom01_label_val_01
    kops.k8s.io/instancegroup: ig_custom01
  role: Node
  rootVolumeSize: 30
  securityGroupOverride: sg-CCCCCCCCCCCC
  subnets:
  - subnet-AAAAAAAAAAAAA
  taints:
  - dedicated=ig_custom01:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1a-3
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1a-3
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-CCCCCCCCCCCC

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1b-2
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1b-2
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-BBBBBBBBBB

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-12-21T10:34:22Z"
  labels:
    kops.k8s.io/cluster: my-test-cluster.k8s.local
  name: master-eu-central-1c-1
spec:
  associatePublicIp: false
  cloudLabels:
    Datacenter: my-dc
    Env: inf
    Hostname: inf-my-test-cluster-master
    Team: platform
  image: ami-021529cc234437cea
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1c-1
  role: Master
  rootVolumeSize: 8
  securityGroupOverride: sg-DDDDDDDDDDDD
  subnets:
  - subnet-AAAAAAAAAAAAA

fejta-bot · 2021-03-21T12:11:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-04-20T13:00:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

johngmyers · 2021-04-20T14:41:57Z

Fixed in 1.20 by #9575

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2021

johngmyers closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kops-controller stale node label values #10185

kops-controller stale node label values #10185

trajakovic commented Nov 6, 2020

johngmyers commented Nov 7, 2020

dnalencastre commented Dec 21, 2020

fejta-bot commented Mar 21, 2021

fejta-bot commented Apr 20, 2021

johngmyers commented Apr 20, 2021

kops-controller stale node label values #10185

kops-controller stale node label values #10185

Comments

trajakovic commented Nov 6, 2020

johngmyers commented Nov 7, 2020

dnalencastre commented Dec 21, 2020

fejta-bot commented Mar 21, 2021

fejta-bot commented Apr 20, 2021

johngmyers commented Apr 20, 2021