AWS Security Group rules are removed when adding/removing worker nodes #64148

zegl · 2018-05-22T13:03:29Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

When scaling up or down the amount of worker nodes, all rules in the the security group managed by the controller-manager are removed.

Here are the logs from the controller manager when the problem started: It doesn't contain any errors or warnings.

I0522 12:43:25.588545       1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[ip-10-1-128-213.eu-west-1.compute.internal:{} ip-10-1-170-254.eu-west-1.compute.internal:{} ip-10-1-149-189.eu-west-1.compute.internal:{} ip-10-1-149-238.eu-west-1.compute.internal:{}]
I0522 12:43:27.168490       1 service_controller.go:644] Successfully updated 15 out of 15 load balancers to direct traffic to the updated set of nodes
I0522 12:43:27.168682       1 event.go:218] Event(v1.ObjectReference{Kind:"Service", Namespace:"ingress-nginx", Name:"ingress-nginx", UID:"b6d3caf6-5290-11e8-a49a-0a96ccd212fe", APIVersion:"v1", ResourceVersion:"133452", FieldPath:""}): type: 'Normal' reason: 'UpdatedLoadBalancer' Updated load balancer with new hosts

I don't know where the number "15" comes from, there's only one LoadBalancer (3 ports, 4 workers, 3 AZs).

Here are the logs from the controller-manager when another node has taken over the leadership:

I0522 12:48:51.268858       1 event.go:218] Event(v1.ObjectReference{Kind:"Service", Namespace:"ingress-nginx", Name:"ingress-nginx", UID:"b6d3caf6-5290-11e8-a49a-0a96ccd212fe", APIVersion:"v1", ResourceVersion:"133452", FieldPath:""}): type: 'Normal' reason: 'EnsuringLoadBalancer' Ensuring load balancer
I0522 12:48:51.282935       1 controller_utils.go:1026] Caches are synced for disruption controller
I0522 12:48:51.282955       1 disruption.go:296] Sending events to api server.
I0522 12:48:52.414735       1 event.go:218] Event(v1.ObjectReference{Kind:"Service", Namespace:"ingress-nginx", Name:"ingress-nginx", UID:"b6d3caf6-5290-11e8-a49a-0a96ccd212fe", APIVersion:"v1", ResourceVersion:"133452", FieldPath:""}): type: 'Normal' reason: 'EnsuredLoadBalancer' Ensured load balancer

What you expected to happen:

Security group rules should not be removed when the LoadBalancer has not changed.

How to reproduce it (as minimally and precisely as possible):

Run Kubernetes worker nodes on AWS,
Add a LoadBalancer service that's using AWS NLBs
Add/remove a worker nodes.

Anything else we need to know?:

I've only tested this with NLBs, ELBs might not be effected.

The Service has the following annotations:

service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"

Restarting the leader controller-manager solves the problem. The new leader will re-add the missing security group rules.

We're running multiple Kubernetes Clusters on the same AWS account.

Environment:

Kubernetes version (use kubectl version): v1.10.2
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04
Install tools: None

The text was updated successfully, but these errors were encountered:

zegl · 2018-05-22T13:07:31Z

/sig AWS

zegl · 2018-05-22T14:01:52Z

I enabled verbose logging and reproduced the problem:

I0522 13:55:42.877392       1 aws_loadbalancer.go:657] Removing rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877432       1 aws_loadbalancer.go:658] Removing rule for client traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877450       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877472       1 aws_loadbalancer.go:657] Removing rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877488       1 aws_loadbalancer.go:658] Removing rule for client traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877504       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877524       1 aws_loadbalancer.go:657] Removing rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877540       1 aws_loadbalancer.go:658] Removing rule for client traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.877555       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([0.0.0.0/0]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:42.942559       1 aws.go:2879] Removing security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 32551,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32551
} {
  FromPort: 31093,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 31093
} {
  FromPort: 32030,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32030
}]
I0522 13:55:43.064509       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:43.064542       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:43.064585       1 aws_loadbalancer.go:660] Removing rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instance (sg-04b4cbf9a3fa15db1)
I0522 13:55:43.084370       1 aws.go:2879] Removing security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 31093,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 31093
} {
  FromPort: 32030,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32030
} {
  FromPort: 32551,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32551
}]
I0522 13:55:43.164137       1 service_controller.go:326] Not persisting unchanged LoadBalancerStatus for service ingress-nginx/ingress-nginx to registry.
I0522 13:55:43.164469       1 event.go:218] Event(v1.ObjectReference{Kind:"Service", Namespace:"ingress-nginx", Name:"ingress-nginx", UID:"b6d3caf6-5290-11e8-a49a-0a96ccd212fe", APIVersion:"v1", ResourceVersion:"133452", FieldPath:""}): type: 'Normal' reason: 'EnsuredLoadBalancer' Ensured load balancer

A re-election causes the next kube-controller-manager to re-add the rules:

I0522 13:59:15.782597       1 aws_loadbalancer.go:650] Adding rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.782625       1 aws_loadbalancer.go:651] Adding rule for client traffic from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.782640       1 aws_loadbalancer.go:650] Adding rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.782649       1 aws_loadbalancer.go:651] Adding rule for client traffic from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.782659       1 aws_loadbalancer.go:650] Adding rule for client MTU discovery from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.782670       1 aws_loadbalancer.go:651] Adding rule for client traffic from the network load balancer ([0.0.0.0/0]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.801114       1 aws.go:2791] Existing security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 3,
  IpProtocol: "icmp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/mtu=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 4
}]
I0522 13:59:15.801170       1 aws.go:2819] Adding security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 32551,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32551
} {
  FromPort: 32030,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32030
} {
  FromPort: 31093,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 31093
}]
I0522 13:59:15.941217       1 aws_loadbalancer.go:653] Adding rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.941249       1 aws_loadbalancer.go:653] Adding rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.941264       1 aws_loadbalancer.go:653] Adding rule for health check traffic from the network load balancer ([10.1.0.0/16]) to instances (sg-04b4cbf9a3fa15db1)
I0522 13:59:15.960322       1 aws.go:2791] Existing security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 32551,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32551
} {
  FromPort: 31093,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 31093
} {
  FromPort: 32030,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/client=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32030
} {
  FromPort: 3,
  IpProtocol: "icmp",
  IpRanges: [{
      CidrIp: "0.0.0.0/0",
      Description: "kubernetes.io/rule/nlb/mtu=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 4
}]
I0522 13:59:15.960394       1 aws.go:2819] Adding security group ingress: sg-04b4cbf9a3fa15db1 [{
  FromPort: 32551,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32551
} {
  FromPort: 32030,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 32030
} {
  FromPort: 31093,
  IpProtocol: "tcp",
  IpRanges: [{
      CidrIp: "10.1.0.0/16",
      Description: "kubernetes.io/rule/nlb/health=ab6d3caf6529011e8a49a0a96ccd212f"
    }],
  ToPort: 31093
}]
I0522 13:59:16.045626       1 service_controller.go:326] Not persisting unchanged LoadBalancerStatus for service ingress-nginx/ingress-nginx to registry.
I0522 13:59:16.045945       1 event.go:218] Event(v1.ObjectReference{Kind:"Service", Namespace:"ingress-nginx", Name:"ingress-nginx", UID:"b6d3caf6-5290-11e8-a49a-0a96ccd212fe", APIVersion:"v1", ResourceVersion:"133452", FieldPath:""}): type: 'Normal' reason: 'EnsuredLoadBalancer' Ensured load balancer

zegl · 2018-05-22T14:03:18Z

Honest question: We have one "k8s-managed-lb" shared between all worker nodes. Should we have one managed security group rule per instance instead? I don't know how that would work together with auto scaling groups, but it's worth asking.

zegl · 2018-05-22T14:58:32Z

/cc @micahhausler

FrederikNJS · 2018-05-22T15:11:09Z

I'm also testing out Kubernetes 1.10.1 with NLBs for ingress, and I'm seeing exactly the same problem.

zegl · 2018-05-22T15:20:07Z

@FrederikNS Thanks for letting me know.

I guess we'll have to use ELBs instead until this issue has been resolved.

FrederikNJS · 2018-05-22T15:30:52Z

It seems that this this might have something to do with using "Private Subnets".

I have 2 clusters running the same version, with the same NLB set up. The only difference between the two clusters is that one has all the worker nodes in public subnets and the other has all nodes in private subnets.

Only the private subnet cluster is experiencing this problem.

I can see from the logs, that right before the ports are removed from the security group, the log outputs Ignoring private subnet for public ELB for all the private worker subnets.

micahhausler · 2018-05-23T15:49:46Z

Yea this looks like a bug. Thanks for opening this.

vikasuy · 2018-05-24T19:02:28Z

related: #60825

fejta-bot · 2018-08-22T19:57:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vikasuy · 2018-08-22T20:10:38Z

/remove-lifecycle stale

kellycampbell · 2018-09-07T21:53:57Z

I think this is fixed by #68422. I tested by replacing a node and watching the security group rules in AWS before/after.

This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148

fejta-bot · 2018-12-06T22:09:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

FrederikNJS · 2018-12-06T23:27:09Z

/remove-lifecycle stale

FrederikNJS · 2018-12-06T23:29:16Z

I have not been able to test this as kops doesn't support 1.13 yet. And I have not heard anyone report it fixed, so I'd like to keep this open until we have confirmation.

This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148

tuapuikia · 2019-02-04T12:37:32Z

kops 1.11 support k8s 1.11.x. I already tested it on 1.11.7 and it working as expected.

arielb135 · 2019-03-03T16:24:11Z

Does this work on 1.10? was it patched?
what i've experienced - it removes it, and puts it back after ~1 minute - which means, full minute where the service is not active.

in our production cluster it didnt even put it back.

jochenhebbrecht · 2019-03-14T06:03:37Z

Hi,

Can somebody confirm in which version(s) of K8S this issue is fixed? We are bumping into the same issue and we are on 1.11.6.

Thanks,
Jochen

kellycampbell · 2019-03-14T11:58:48Z

Please check the changelog files posted on git.

…

On Thu, Mar 14, 2019, 2:04 AM jochenhebbrecht ***@***.***> wrote: Hi, Can somebody confirm in which version(s) of K8S this issue is fixed? We are bumping into the same issue and we are on 1.11.6. Thanks, Jochen — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#64148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAmNTpF5i9cosKY4eptG4LuP48teeMhAks5vWeZ4gaJpZM4UImi3> .

jochenhebbrecht · 2019-03-14T12:37:31Z

@kellycampbell thanks! Just located this file: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md and I've noticed it should be fixed in 1.11.6. Unfortunately, that's the version we are on and we're still facing the issue.

From what I can read from the logs, it seems that the leader kubectl-controller-manager pod is stuck. It stopped updating the security group rules.

kellycampbell · 2019-03-14T14:43:59Z

It's in 1.11.7. Maybe you were confused by the changelog title "Changelog since v1.11.6" ?

…

On Thu, Mar 14, 2019 at 8:38 AM jochenhebbrecht ***@***.***> wrote: @kellycampbell <https://github.com/kellycampbell> thanks! Just located this file: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md and I've noticed it should be fixed in 1.11.6. Unfortunately, that's the version we are on and we're still facing the issue. From what I can read from the logs, it seems that the leader kubectl-controller-manager pod is stuck. It stopped updating the security group rules. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#64148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAmNTr4aRfY0G2NZ7AZCCOb5pXMzRvlEks5vWkLcgaJpZM4UImi3> .

jochenhebbrecht · 2019-03-14T16:18:40Z

Yes, sorry, that title confused me.
Ok, we'll try to upgrade our K8S cluster and we'll verify if the issue pops up again or not.

* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32. 1. MaxInt32 has the same meaning as unset, for compatibility 2. Deployment controller treats MaxInt32 the same as unset (nil) * Update API doc of ProgressDeadlineSeconds * Autogen 1. hack/update-generated-protobuf.sh 2. hack/update-generated-swagger-docs.sh 3. hack/update-swagger-spec.sh 4. hack/update-openapi-spec.sh 5. hack/update-api-reference-docs.sh * Lookup PX api port from k8s service Fixes kubernetes#70033 Signed-off-by: Harsh Desai <harsh@portworx.com> * cache portworx API port - reused client whenever possible - refactor get client function into explicit cluster-wide and local functions Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix bug with volume getting marked as not in-use with pending op Add test for verifying volume detach * Fix flake with e2e test that checks detach while mount in progress A volume can show up as in-use even before it gets attached to the node. * fix node and kubelet start times * Bump golang to 1.10.7 (CVE-2018-16875) * Kubernetes version v1.11.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.6. * New sysctls to improve pod termination * Retry scheduling on various events. * Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions. * fix predicate invalidation method * Fixed clearing of devicePath after UnmountDevice UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it. * fix race condition when attach azure disk in vmss fix gofmt issue * Check for volume-subpaths directory in orpahaned pod cleanup * Leave refactoring TODO * Update BUILD file * Protect Netlink calls with a mutex * Fix race in setting nominated node * autogenerated files * update cloud provider boilerplate The pull-kubernetes-verify presubmit is failing on verify-cloudprovider-gce.sh because it is a new year and thus current test generated code doesn't match the prior committed generated code in the copyright header. The verifier is removed in master now, so for simplicity and rather than fixing the verifier to ignore the header differences for prior supported branched, this commit is the result of rerunning hack/update-cloudprovider-gce.sh. Signed-off-by: Tim Pepper <tpepper@vmware.com> * Cluster Autoscaler 1.3.5 * Move unmount volume util from pkg/volume/util to pkg/util/mount * Update doCleanSubpaths to use UnmountMountPoint * Add unit test for UnmountMountPoint * Add comments around use of PathExists * Move linux test utils to os-independent test file * Rename UnmountMountPoint to CleanupMountPoint * Add e2e test for removing the subpath directory * change azure disk host cache to ReadOnly by default change cachingMode default value for azure disk PV revert back to ReadWrite in azure disk PV setting * activate unschedulable pods only if the node became more schedulable * make integration/verify script look for k8s under GOPATH * Clean up artifacts variables in hack scripts * use json format to get rbd image size * change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com> * Fix action required for pr 61373 * Fix kube-proxy PodSecurityPolicy RoleBinding namespace * Find current resourceVersion for waiting for deletion/conditions * Add e2e test for file exec * Fix nil panic propagation * Add `metrics-port` to kube-proxy cmd flags. * Fix AWS NLB security group updates This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148 * Unit test for aws_lb security group filtering kubernetes#60825 * Do not snapshot scheduler cache before starting preemption * Fix and improve preemption test to work with the new logic * changelog duplicate * Increase limit for object size in streaming serializer * Attempt to deflake HPA e2e test Increase CPU usage requested from resource consumer. Observed CPU usage must: - be consistently above 300 milliCPU (2 pods * 500 mCPU request per pod * .3 target utilization) to avoid scaling down below 3. - never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4. Also improve logging in case this doesn't solve the problem. Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3 * Kubernetes version v1.11.8-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.7. * Correlate max-inflight values in GCE with master VM sizes * Update to go1.10.8 * Don't error on error on deprecated native http_archive rule * add goroutine to move unschedulablepods to activeq regularly * Always select the in-memory group/version as a target when decoding from storage * fix mac filtering in vsphere cloud provider * fix mac filtering in vsphere cloud provider * Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * support multiple cidr vpc for nlb health check * Use watch cache when rv=0 even when limit is set * Avoid going back in time in watchcache watchers * Bump the pod memory to higher levels to work on power * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch@73af7f5 Signed-off-by: Brandon Philips <brandon@ifup.org> * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch#64 * update json-patch to pick up bug fixes * Importing latest json-patch. * Set the maximum size increase the copy operations in a json patch can cause * Adding a limit on the maximum bytes accepted to be decoded in a resource write request. * Cluster Autoscaler 1.3.7 * Make intergration test helper public. This was done in the master branch in kubernetes#69902. The pull includes many other changes, so we made this targeted patch. * add integration test * Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf. * Limit the number of operations in a single json patch to be 10,000 * Fix testing if an interface is the loopback It's not guaranteed that the loopback interface only has the loopback IP, in our environments our loopback interface is also assigned a 169 address as well. * fix smb remount issue on Windows add comments for doSMBMount func fix comments about smb mount fix build error * Allow headless svc without ports to have endpoints As cited in kubernetes/dns#174 - this is documented to work, and I don't see why it shouldn't work. We allowed the definition of headless services without ports, but apparently nobody tested it very well. Manually tested clusterIP services with no ports - validation error. Manually tested services with negative ports - validation error. New tests failed, output inspected and verified. Now pass. * do not return error on invalid mac address in vsphere cloud provider * remove get azure accounts in the init process set timeout for get azure account operation use const for timeout value remove get azure accounts in the init process add lock for account init * add timeout in GetVolumeLimits operation add timeout for getAllStorageAccounts * record event on endpoint update failure * fix parse devicePath issue on Azure Disk * add retry for detach azure disk add more logging info in detach disk add azure disk attach/detach logs * Fix find-binary to locate bazel e2e tests * Reduce cardinality of admission webhook metrics * Explicitly set GVK when sending objects to webhooks * Kubernetes version v1.11.9-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.8. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * Fix panic in kubectl cp command * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * add module 'nf_conntrack' in ipvs prerequisite check * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * build/gci: bump CNI version to 0.7.5 * Fix size of repd e2e to use Gi * Missed one changes.

* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32. 1. MaxInt32 has the same meaning as unset, for compatibility 2. Deployment controller treats MaxInt32 the same as unset (nil) * Update API doc of ProgressDeadlineSeconds * Autogen 1. hack/update-generated-protobuf.sh 2. hack/update-generated-swagger-docs.sh 3. hack/update-swagger-spec.sh 4. hack/update-openapi-spec.sh 5. hack/update-api-reference-docs.sh * Lookup PX api port from k8s service Fixes kubernetes#70033 Signed-off-by: Harsh Desai <harsh@portworx.com> * cache portworx API port - reused client whenever possible - refactor get client function into explicit cluster-wide and local functions Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix bug with volume getting marked as not in-use with pending op Add test for verifying volume detach * Fix flake with e2e test that checks detach while mount in progress A volume can show up as in-use even before it gets attached to the node. * fix node and kubelet start times * Bump golang to 1.10.7 (CVE-2018-16875) * Kubernetes version v1.11.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.6. * New sysctls to improve pod termination * Retry scheduling on various events. * Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions. * fix predicate invalidation method * Fixed clearing of devicePath after UnmountDevice UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it. * fix race condition when attach azure disk in vmss fix gofmt issue * Check for volume-subpaths directory in orpahaned pod cleanup * Leave refactoring TODO * Update BUILD file * Protect Netlink calls with a mutex * Fix race in setting nominated node * autogenerated files * update cloud provider boilerplate The pull-kubernetes-verify presubmit is failing on verify-cloudprovider-gce.sh because it is a new year and thus current test generated code doesn't match the prior committed generated code in the copyright header. The verifier is removed in master now, so for simplicity and rather than fixing the verifier to ignore the header differences for prior supported branched, this commit is the result of rerunning hack/update-cloudprovider-gce.sh. Signed-off-by: Tim Pepper <tpepper@vmware.com> * Cluster Autoscaler 1.3.5 * Move unmount volume util from pkg/volume/util to pkg/util/mount * Update doCleanSubpaths to use UnmountMountPoint * Add unit test for UnmountMountPoint * Add comments around use of PathExists * Move linux test utils to os-independent test file * Rename UnmountMountPoint to CleanupMountPoint * Add e2e test for removing the subpath directory * change azure disk host cache to ReadOnly by default change cachingMode default value for azure disk PV revert back to ReadWrite in azure disk PV setting * activate unschedulable pods only if the node became more schedulable * make integration/verify script look for k8s under GOPATH * Clean up artifacts variables in hack scripts * use json format to get rbd image size * change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com> * Fix action required for pr 61373 * Fix kube-proxy PodSecurityPolicy RoleBinding namespace * Find current resourceVersion for waiting for deletion/conditions * Add e2e test for file exec * Fix nil panic propagation * Add `metrics-port` to kube-proxy cmd flags. * Fix AWS NLB security group updates This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148 * Unit test for aws_lb security group filtering kubernetes#60825 * Do not snapshot scheduler cache before starting preemption * Fix and improve preemption test to work with the new logic * changelog duplicate * Increase limit for object size in streaming serializer * Attempt to deflake HPA e2e test Increase CPU usage requested from resource consumer. Observed CPU usage must: - be consistently above 300 milliCPU (2 pods * 500 mCPU request per pod * .3 target utilization) to avoid scaling down below 3. - never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4. Also improve logging in case this doesn't solve the problem. Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3 * Kubernetes version v1.11.8-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.7. * Correlate max-inflight values in GCE with master VM sizes * Update to go1.10.8 * Don't error on error on deprecated native http_archive rule * add goroutine to move unschedulablepods to activeq regularly * Always select the in-memory group/version as a target when decoding from storage * fix mac filtering in vsphere cloud provider * fix mac filtering in vsphere cloud provider * Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * support multiple cidr vpc for nlb health check * Use watch cache when rv=0 even when limit is set * Avoid going back in time in watchcache watchers * Bump the pod memory to higher levels to work on power * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch@73af7f5 Signed-off-by: Brandon Philips <brandon@ifup.org> * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch#64 * update json-patch to pick up bug fixes * Importing latest json-patch. * Set the maximum size increase the copy operations in a json patch can cause * Adding a limit on the maximum bytes accepted to be decoded in a resource write request. * Cluster Autoscaler 1.3.7 * Make intergration test helper public. This was done in the master branch in kubernetes#69902. The pull includes many other changes, so we made this targeted patch. * add integration test * Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf. * Limit the number of operations in a single json patch to be 10,000 * Fix testing if an interface is the loopback It's not guaranteed that the loopback interface only has the loopback IP, in our environments our loopback interface is also assigned a 169 address as well. * fix smb remount issue on Windows add comments for doSMBMount func fix comments about smb mount fix build error * Allow headless svc without ports to have endpoints As cited in kubernetes/dns#174 - this is documented to work, and I don't see why it shouldn't work. We allowed the definition of headless services without ports, but apparently nobody tested it very well. Manually tested clusterIP services with no ports - validation error. Manually tested services with negative ports - validation error. New tests failed, output inspected and verified. Now pass. * do not return error on invalid mac address in vsphere cloud provider * remove get azure accounts in the init process set timeout for get azure account operation use const for timeout value remove get azure accounts in the init process add lock for account init * add timeout in GetVolumeLimits operation add timeout for getAllStorageAccounts * record event on endpoint update failure * fix parse devicePath issue on Azure Disk * add retry for detach azure disk add more logging info in detach disk add azure disk attach/detach logs * Fix find-binary to locate bazel e2e tests * Reduce cardinality of admission webhook metrics * Explicitly set GVK when sending objects to webhooks * Kubernetes version v1.11.9-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.8. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * Fix panic in kubectl cp command * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * Update Cluster Autoscaler version to 1.3.8 * add module 'nf_conntrack' in ipvs prerequisite check * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * kubelet: updated logic of verifying a static critical pod - check if a pod is static by its static pod info - meanwhile, check if a pod is critical by its corresponding mirror pod info * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * Fix size of repd e2e to use Gi * bump repd min size in e2es * allows configuring NPD release and flags on GCI and add cluster e2e test * allows configuring NPD image version in node e2e test and fix the test * Kubernetes version v1.11.10-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.9. * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Restore *filter table for ipvs Resolve: kubernetes#68194 * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * Bump debian-iptables to v11.0.2. * Updated Regional PD failover test to use node taints instead of instance group deletion * Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template * Removed istio related addon manifests, as the directory is deprecated. * Use Node-Problem-Detector v0.6.3 on GCI * Increase default maximumLoadBalancerRuleCount to 250 * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * disable HTTP2 ingress test * Upgrade compute API to version 2019-03-01 * Replace vmss update API with instance-level update API * Cleanup codes that not required any more * Cleanup interfaces and add unit tests * Update vendors * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * Move back APIs to Azure stack supported version (#19)

jaybe78 · 2019-05-18T10:32:34Z

I also have a kubernetes cluster in a vpc and I still get the same issue. i'm on kubernetes 1.11.9.
My 2 worker nodes do not pass health check.
@jochenhebbrecht @kellycampbell @arielb135 Have you guys managed to make it work on your side ?

jochenhebbrecht · 2019-05-19T06:48:17Z

Hi @jaybe78 .

Yes, we managed to make it work by upgrading to 1.11.7. We're currently still on that version and we are no longer bumping into this issue.

Jochen

M00nF1sh · 2019-05-28T21:44:14Z

@jaybe78 I believe the original issue is already fixed by #68422 (which got back ported to older versions).

Would you help share your serviceSpec and worker node securitygroup settings on AWS to me(@M00nF1sh in k8s slack channel), i can help take a look

dhanvi · 2019-08-27T09:34:26Z

/sig cloud-provider

zegl · 2019-08-27T10:49:28Z

I'll close this. The bug has been fixed (and released) on v1.11, v1.12, v1.13, and v1.14.

1hanymhajna · 2022-01-20T08:21:23Z

Hello,
We just saw the same behavior in v1.18 version,
During node replacement (Scaling up new nodes, and moving old nodes to cordon mode) we notice that the sg has been updated automatically and the relevant ELB rule has been removed.
We are using EKS-1.18, and our load balancer type is ELB, nodes are managed by us.

We can see the event triggered by the workers (in cloudtrail) so it approved it happened automatically from the controller side ( not a human mistake)
But it just happened suddenly, we tried the same flow in other clusters this didn't happen there.
I think ti fix has a specific condition that triggers the bug again or something.

We saw it as well in dev cluster, when we lost some spot nodes it also happened, but again it not happen all the time. so still don't have specific flow how to reproduce it

@zegl

1hanymhajna · 2022-01-20T08:38:58Z

The fix is just to handle NLB type, doesn't need to make it global? (ELB, ALB) ?

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels May 22, 2018

k8s-ci-robot added sig/aws and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 22, 2018

vikasuy mentioned this issue Jun 12, 2018

"externalTrafficPolicy": "Local" on AWS does not work if the dhcp of the vpc is not set exactly to <region>.compute.internal #61486

Closed

micahhausler mentioned this issue Jul 24, 2018

Support AWS Network Load Balancer kubernetes/enhancements#423

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2018

texmachina mentioned this issue Oct 9, 2018

AWS NLB TargetGroups suddenly fail healthcheck #69476

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2018

s3than mentioned this issue Dec 12, 2018

Master send RevokeSecurityGroup for NLB traffic port when it should not be revoked kubernetes/kops#5952

Closed

patrickleet mentioned this issue Dec 13, 2018

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet aws/containers-roadmap#62

Closed

k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. and removed sig/aws labels Aug 6, 2019

k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 27, 2019

zegl closed this as completed Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Security Group rules are removed when adding/removing worker nodes #64148

AWS Security Group rules are removed when adding/removing worker nodes #64148

zegl commented May 22, 2018 •

edited

Loading

zegl commented May 22, 2018

zegl commented May 22, 2018

zegl commented May 22, 2018 •

edited

Loading

zegl commented May 22, 2018

FrederikNJS commented May 22, 2018

zegl commented May 22, 2018

FrederikNJS commented May 22, 2018

micahhausler commented May 23, 2018

vikasuy commented May 24, 2018

fejta-bot commented Aug 22, 2018

vikasuy commented Aug 22, 2018

kellycampbell commented Sep 7, 2018

fejta-bot commented Dec 6, 2018

FrederikNJS commented Dec 6, 2018

FrederikNJS commented Dec 6, 2018

tuapuikia commented Feb 4, 2019

arielb135 commented Mar 3, 2019

jochenhebbrecht commented Mar 14, 2019

kellycampbell commented Mar 14, 2019 via email

jochenhebbrecht commented Mar 14, 2019

kellycampbell commented Mar 14, 2019 via email

jochenhebbrecht commented Mar 14, 2019

jaybe78 commented May 18, 2019

jochenhebbrecht commented May 19, 2019

M00nF1sh commented May 28, 2019

dhanvi commented Aug 27, 2019

zegl commented Aug 27, 2019

1hanymhajna commented Jan 20, 2022 •

edited

Loading

1hanymhajna commented Jan 20, 2022

AWS Security Group rules are removed when adding/removing worker nodes #64148

AWS Security Group rules are removed when adding/removing worker nodes #64148

Comments

zegl commented May 22, 2018 • edited Loading

zegl commented May 22, 2018

zegl commented May 22, 2018

zegl commented May 22, 2018 • edited Loading

zegl commented May 22, 2018

FrederikNJS commented May 22, 2018

zegl commented May 22, 2018

FrederikNJS commented May 22, 2018

micahhausler commented May 23, 2018

vikasuy commented May 24, 2018

fejta-bot commented Aug 22, 2018

vikasuy commented Aug 22, 2018

kellycampbell commented Sep 7, 2018

fejta-bot commented Dec 6, 2018

FrederikNJS commented Dec 6, 2018

FrederikNJS commented Dec 6, 2018

tuapuikia commented Feb 4, 2019

arielb135 commented Mar 3, 2019

jochenhebbrecht commented Mar 14, 2019

kellycampbell commented Mar 14, 2019 via email

jochenhebbrecht commented Mar 14, 2019

kellycampbell commented Mar 14, 2019 via email

jochenhebbrecht commented Mar 14, 2019

jaybe78 commented May 18, 2019

jochenhebbrecht commented May 19, 2019

M00nF1sh commented May 28, 2019

dhanvi commented Aug 27, 2019

zegl commented Aug 27, 2019

1hanymhajna commented Jan 20, 2022 • edited Loading

1hanymhajna commented Jan 20, 2022

zegl commented May 22, 2018 •

edited

Loading

zegl commented May 22, 2018 •

edited

Loading

1hanymhajna commented Jan 20, 2022 •

edited

Loading