-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Security Group rules are removed when adding/removing worker nodes #64148
Comments
/sig AWS |
I enabled verbose logging and reproduced the problem:
A re-election causes the next kube-controller-manager to re-add the rules:
|
Honest question: We have one "k8s-managed-lb" shared between all worker nodes. Should we have one managed security group rule per instance instead? I don't know how that would work together with auto scaling groups, but it's worth asking. |
/cc @micahhausler |
I'm also testing out Kubernetes 1.10.1 with NLBs for ingress, and I'm seeing exactly the same problem. |
@FrederikNS Thanks for letting me know. I guess we'll have to use ELBs instead until this issue has been resolved. |
It seems that this this might have something to do with using "Private Subnets". I have 2 clusters running the same version, with the same NLB set up. The only difference between the two clusters is that one has all the worker nodes in public subnets and the other has all nodes in private subnets. Only the private subnet cluster is experiencing this problem. I can see from the logs, that right before the ports are removed from the security group, the log outputs |
Yea this looks like a bug. Thanks for opening this. |
related: #60825 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I think this is fixed by #68422. I tested by replacing a node and watching the security group rules in AWS before/after. |
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I have not been able to test this as kops doesn't support 1.13 yet. And I have not heard anyone report it fixed, so I'd like to keep this open until we have confirmation. |
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148
kops 1.11 support k8s 1.11.x. I already tested it on 1.11.7 and it working as expected. |
Does this work on 1.10? was it patched? in our production cluster it didnt even put it back. |
Hi, Can somebody confirm in which version(s) of K8S this issue is fixed? We are bumping into the same issue and we are on Thanks, |
Please check the changelog files posted on git.
…On Thu, Mar 14, 2019, 2:04 AM jochenhebbrecht ***@***.***> wrote:
Hi,
Can somebody confirm in which version(s) of K8S this issue is fixed? We
are bumping into the same issue and we are on 1.11.6.
Thanks,
Jochen
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#64148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAmNTpF5i9cosKY4eptG4LuP48teeMhAks5vWeZ4gaJpZM4UImi3>
.
|
@kellycampbell thanks! Just located this file: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md and I've noticed it should be fixed in 1.11.6. Unfortunately, that's the version we are on and we're still facing the issue. From what I can read from the logs, it seems that the leader |
It's in 1.11.7. Maybe you were confused by the changelog title "Changelog
since v1.11.6" ?
…On Thu, Mar 14, 2019 at 8:38 AM jochenhebbrecht ***@***.***> wrote:
@kellycampbell <https://github.com/kellycampbell> thanks! Just located
this file:
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md
and I've noticed it should be fixed in 1.11.6. Unfortunately, that's the
version we are on and we're still facing the issue.
From what I can read from the logs, it seems that the leader
kubectl-controller-manager pod is stuck. It stopped updating the security
group rules.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#64148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAmNTr4aRfY0G2NZ7AZCCOb5pXMzRvlEks5vWkLcgaJpZM4UImi3>
.
|
Yes, sorry, that title confused me. |
* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32. 1. MaxInt32 has the same meaning as unset, for compatibility 2. Deployment controller treats MaxInt32 the same as unset (nil) * Update API doc of ProgressDeadlineSeconds * Autogen 1. hack/update-generated-protobuf.sh 2. hack/update-generated-swagger-docs.sh 3. hack/update-swagger-spec.sh 4. hack/update-openapi-spec.sh 5. hack/update-api-reference-docs.sh * Lookup PX api port from k8s service Fixes kubernetes#70033 Signed-off-by: Harsh Desai <harsh@portworx.com> * cache portworx API port - reused client whenever possible - refactor get client function into explicit cluster-wide and local functions Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix bug with volume getting marked as not in-use with pending op Add test for verifying volume detach * Fix flake with e2e test that checks detach while mount in progress A volume can show up as in-use even before it gets attached to the node. * fix node and kubelet start times * Bump golang to 1.10.7 (CVE-2018-16875) * Kubernetes version v1.11.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.6. * New sysctls to improve pod termination * Retry scheduling on various events. * Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions. * fix predicate invalidation method * Fixed clearing of devicePath after UnmountDevice UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it. * fix race condition when attach azure disk in vmss fix gofmt issue * Check for volume-subpaths directory in orpahaned pod cleanup * Leave refactoring TODO * Update BUILD file * Protect Netlink calls with a mutex * Fix race in setting nominated node * autogenerated files * update cloud provider boilerplate The pull-kubernetes-verify presubmit is failing on verify-cloudprovider-gce.sh because it is a new year and thus current test generated code doesn't match the prior committed generated code in the copyright header. The verifier is removed in master now, so for simplicity and rather than fixing the verifier to ignore the header differences for prior supported branched, this commit is the result of rerunning hack/update-cloudprovider-gce.sh. Signed-off-by: Tim Pepper <tpepper@vmware.com> * Cluster Autoscaler 1.3.5 * Move unmount volume util from pkg/volume/util to pkg/util/mount * Update doCleanSubpaths to use UnmountMountPoint * Add unit test for UnmountMountPoint * Add comments around use of PathExists * Move linux test utils to os-independent test file * Rename UnmountMountPoint to CleanupMountPoint * Add e2e test for removing the subpath directory * change azure disk host cache to ReadOnly by default change cachingMode default value for azure disk PV revert back to ReadWrite in azure disk PV setting * activate unschedulable pods only if the node became more schedulable * make integration/verify script look for k8s under GOPATH * Clean up artifacts variables in hack scripts * use json format to get rbd image size * change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com> * Fix action required for pr 61373 * Fix kube-proxy PodSecurityPolicy RoleBinding namespace * Find current resourceVersion for waiting for deletion/conditions * Add e2e test for file exec * Fix nil panic propagation * Add `metrics-port` to kube-proxy cmd flags. * Fix AWS NLB security group updates This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148 * Unit test for aws_lb security group filtering kubernetes#60825 * Do not snapshot scheduler cache before starting preemption * Fix and improve preemption test to work with the new logic * changelog duplicate * Increase limit for object size in streaming serializer * Attempt to deflake HPA e2e test Increase CPU usage requested from resource consumer. Observed CPU usage must: - be consistently above 300 milliCPU (2 pods * 500 mCPU request per pod * .3 target utilization) to avoid scaling down below 3. - never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4. Also improve logging in case this doesn't solve the problem. Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3 * Kubernetes version v1.11.8-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.7. * Correlate max-inflight values in GCE with master VM sizes * Update to go1.10.8 * Don't error on error on deprecated native http_archive rule * add goroutine to move unschedulablepods to activeq regularly * Always select the in-memory group/version as a target when decoding from storage * fix mac filtering in vsphere cloud provider * fix mac filtering in vsphere cloud provider * Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * support multiple cidr vpc for nlb health check * Use watch cache when rv=0 even when limit is set * Avoid going back in time in watchcache watchers * Bump the pod memory to higher levels to work on power * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch@73af7f5 Signed-off-by: Brandon Philips <brandon@ifup.org> * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch#64 * update json-patch to pick up bug fixes * Importing latest json-patch. * Set the maximum size increase the copy operations in a json patch can cause * Adding a limit on the maximum bytes accepted to be decoded in a resource write request. * Cluster Autoscaler 1.3.7 * Make intergration test helper public. This was done in the master branch in kubernetes#69902. The pull includes many other changes, so we made this targeted patch. * add integration test * Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf. * Limit the number of operations in a single json patch to be 10,000 * Fix testing if an interface is the loopback It's not guaranteed that the loopback interface only has the loopback IP, in our environments our loopback interface is also assigned a 169 address as well. * fix smb remount issue on Windows add comments for doSMBMount func fix comments about smb mount fix build error * Allow headless svc without ports to have endpoints As cited in kubernetes/dns#174 - this is documented to work, and I don't see why it shouldn't work. We allowed the definition of headless services without ports, but apparently nobody tested it very well. Manually tested clusterIP services with no ports - validation error. Manually tested services with negative ports - validation error. New tests failed, output inspected and verified. Now pass. * do not return error on invalid mac address in vsphere cloud provider * remove get azure accounts in the init process set timeout for get azure account operation use const for timeout value remove get azure accounts in the init process add lock for account init * add timeout in GetVolumeLimits operation add timeout for getAllStorageAccounts * record event on endpoint update failure * fix parse devicePath issue on Azure Disk * add retry for detach azure disk add more logging info in detach disk add azure disk attach/detach logs * Fix find-binary to locate bazel e2e tests * Reduce cardinality of admission webhook metrics * Explicitly set GVK when sending objects to webhooks * Kubernetes version v1.11.9-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.8. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * Fix panic in kubectl cp command * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * add module 'nf_conntrack' in ipvs prerequisite check * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * build/gci: bump CNI version to 0.7.5 * Fix size of repd e2e to use Gi * Missed one changes.
* Default extensions/v1beta1 Deployment's ProgressDeadlineSeconds to MaxInt32. 1. MaxInt32 has the same meaning as unset, for compatibility 2. Deployment controller treats MaxInt32 the same as unset (nil) * Update API doc of ProgressDeadlineSeconds * Autogen 1. hack/update-generated-protobuf.sh 2. hack/update-generated-swagger-docs.sh 3. hack/update-swagger-spec.sh 4. hack/update-openapi-spec.sh 5. hack/update-api-reference-docs.sh * Lookup PX api port from k8s service Fixes kubernetes#70033 Signed-off-by: Harsh Desai <harsh@portworx.com> * cache portworx API port - reused client whenever possible - refactor get client function into explicit cluster-wide and local functions Signed-off-by: Harsh Desai <harsh@portworx.com> * Fix bug with volume getting marked as not in-use with pending op Add test for verifying volume detach * Fix flake with e2e test that checks detach while mount in progress A volume can show up as in-use even before it gets attached to the node. * fix node and kubelet start times * Bump golang to 1.10.7 (CVE-2018-16875) * Kubernetes version v1.11.7-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.6. * New sysctls to improve pod termination * Retry scheduling on various events. * Test rescheduling on various events. - Add resyncPeriod parameter for setupCluster to make resync period of scheduler configurable. - Add test case for static provisioning and delay binding storage class. Move pods into active queue on PV add/update events. - Add a stress test with scheduler resync to detect possible race conditions. * fix predicate invalidation method * Fixed clearing of devicePath after UnmountDevice UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it. * fix race condition when attach azure disk in vmss fix gofmt issue * Check for volume-subpaths directory in orpahaned pod cleanup * Leave refactoring TODO * Update BUILD file * Protect Netlink calls with a mutex * Fix race in setting nominated node * autogenerated files * update cloud provider boilerplate The pull-kubernetes-verify presubmit is failing on verify-cloudprovider-gce.sh because it is a new year and thus current test generated code doesn't match the prior committed generated code in the copyright header. The verifier is removed in master now, so for simplicity and rather than fixing the verifier to ignore the header differences for prior supported branched, this commit is the result of rerunning hack/update-cloudprovider-gce.sh. Signed-off-by: Tim Pepper <tpepper@vmware.com> * Cluster Autoscaler 1.3.5 * Move unmount volume util from pkg/volume/util to pkg/util/mount * Update doCleanSubpaths to use UnmountMountPoint * Add unit test for UnmountMountPoint * Add comments around use of PathExists * Move linux test utils to os-independent test file * Rename UnmountMountPoint to CleanupMountPoint * Add e2e test for removing the subpath directory * change azure disk host cache to ReadOnly by default change cachingMode default value for azure disk PV revert back to ReadWrite in azure disk PV setting * activate unschedulable pods only if the node became more schedulable * make integration/verify script look for k8s under GOPATH * Clean up artifacts variables in hack scripts * use json format to get rbd image size * change sort function of scheduling queue to avoid starvation when unschedulable pods are in the queue When starvation heppens: - a lot of unschedulable pods exists in the head of queue - because condition.LastTransitionTime is updated only when condition.Status changed - (this means that once a pod is marked unschedulable, the field never updated until the pod successfuly scheduled.) What was changed: - condition.LastProbeTime is updated everytime when pod is determined unschedulable. - changed sort function so to use LastProbeTime to avoid starvation described above Consideration: - This changes increases k8s API server load because it updates Pod.status whenever scheduler decides it as unschedulable. Signed-off-by: Shingo Omura <everpeace@gmail.com> * Fix action required for pr 61373 * Fix kube-proxy PodSecurityPolicy RoleBinding namespace * Find current resourceVersion for waiting for deletion/conditions * Add e2e test for file exec * Fix nil panic propagation * Add `metrics-port` to kube-proxy cmd flags. * Fix AWS NLB security group updates This corrects a problem where valid security group ports were removed unintentionally when updating a service or when node changes occur. Fixes kubernetes#60825, kubernetes#64148 * Unit test for aws_lb security group filtering kubernetes#60825 * Do not snapshot scheduler cache before starting preemption * Fix and improve preemption test to work with the new logic * changelog duplicate * Increase limit for object size in streaming serializer * Attempt to deflake HPA e2e test Increase CPU usage requested from resource consumer. Observed CPU usage must: - be consistently above 300 milliCPU (2 pods * 500 mCPU request per pod * .3 target utilization) to avoid scaling down below 3. - never exceed 600 mCPU (4 pods * ...) to avoid scaling up above 4. Also improve logging in case this doesn't solve the problem. Change-Id: Id1d9c0193ccfa063855b29c5274587f05c1eb4d3 * Kubernetes version v1.11.8-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.7. * Correlate max-inflight values in GCE with master VM sizes * Update to go1.10.8 * Don't error on error on deprecated native http_archive rule * add goroutine to move unschedulablepods to activeq regularly * Always select the in-memory group/version as a target when decoding from storage * fix mac filtering in vsphere cloud provider * fix mac filtering in vsphere cloud provider * Fix kubernetes#73479 AWS NLB target groups missing tags `elbv2.AddTags` doesn't seem to support assigning the same set of tags to multiple resources at once leading to the following error: Error adding tags after modifying load balancer targets: "ValidationError: Only one resource can be tagged at a time" This can happen when using AWS NLB with multiple listeners pointing to different node ports. When k8s creates a NLB it creates a target group per listener along with installing security group ingress rules allowing the traffic to reach the k8s nodes. Unfortunately if those target groups are not tagged, k8s will not manage them, thinking it is not the owner. This small changes assigns tags one resource at a time instead of batching them as before. Signed-off-by: Brice Figureau <brice@daysofwonder.com> * support multiple cidr vpc for nlb health check * Use watch cache when rv=0 even when limit is set * Avoid going back in time in watchcache watchers * Bump the pod memory to higher levels to work on power * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch@73af7f5 Signed-off-by: Brandon Philips <brandon@ifup.org> * vendor: bump github.com/evanphx/json-patch Grab important bug fix that can cause a `panic()` from this package on certain inputs. See evanphx/json-patch#64 * update json-patch to pick up bug fixes * Importing latest json-patch. * Set the maximum size increase the copy operations in a json patch can cause * Adding a limit on the maximum bytes accepted to be decoded in a resource write request. * Cluster Autoscaler 1.3.7 * Make intergration test helper public. This was done in the master branch in kubernetes#69902. The pull includes many other changes, so we made this targeted patch. * add integration test * Loosing the request body size limit to 100MB to account for the size ratio between json and protobuf. * Limit the number of operations in a single json patch to be 10,000 * Fix testing if an interface is the loopback It's not guaranteed that the loopback interface only has the loopback IP, in our environments our loopback interface is also assigned a 169 address as well. * fix smb remount issue on Windows add comments for doSMBMount func fix comments about smb mount fix build error * Allow headless svc without ports to have endpoints As cited in kubernetes/dns#174 - this is documented to work, and I don't see why it shouldn't work. We allowed the definition of headless services without ports, but apparently nobody tested it very well. Manually tested clusterIP services with no ports - validation error. Manually tested services with negative ports - validation error. New tests failed, output inspected and verified. Now pass. * do not return error on invalid mac address in vsphere cloud provider * remove get azure accounts in the init process set timeout for get azure account operation use const for timeout value remove get azure accounts in the init process add lock for account init * add timeout in GetVolumeLimits operation add timeout for getAllStorageAccounts * record event on endpoint update failure * fix parse devicePath issue on Azure Disk * add retry for detach azure disk add more logging info in detach disk add azure disk attach/detach logs * Fix find-binary to locate bazel e2e tests * Reduce cardinality of admission webhook metrics * Explicitly set GVK when sending objects to webhooks * Kubernetes version v1.11.9-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.8. * add Azure Container Registry anonymous repo support apply fix for msi and fix test failure * cri_stats_provider: overload nil as 0 for exited containers stats Always report 0 cpu/memory usage for exited containers to make metrics-server work as expect. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> * Fix panic in kubectl cp command * Adding a check to make sure UseInstanceMetadata flag is true to get data from metadata. * Update Cluster Autoscaler version to 1.3.8 * add module 'nf_conntrack' in ipvs prerequisite check * Ensure Azure load balancer cleaned up on 404 or 403 * Allow disable outbound snat when Azure standard load balancer is used * kubelet: updated logic of verifying a static critical pod - check if a pod is static by its static pod info - meanwhile, check if a pod is critical by its corresponding mirror pod info * Allow session affinity a period of time to setup for new services. This is to deal with the flaky session affinity test. * Restore username and password kubectl flags * build/gci: bump CNI version to 0.7.5 * Fix size of repd e2e to use Gi * bump repd min size in e2es * allows configuring NPD release and flags on GCI and add cluster e2e test * allows configuring NPD image version in node e2e test and fix the test * Kubernetes version v1.11.10-beta.0 openapi-spec file updates * Add/Update CHANGELOG-1.11.md for v1.11.9. * stop vsphere cloud provider from spamming logs with `failed to patch IP` Fixes: kubernetes#75236 * Restore *filter table for ipvs Resolve: kubernetes#68194 * Update gcp images with security patches [stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes. [fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes. [fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes. [metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes. * Bump debian-iptables to v11.0.2. * Updated Regional PD failover test to use node taints instead of instance group deletion * Updated regional PD minimum size; changed regional PD failover test to use StorageClassTest to generate PVC template * Removed istio related addon manifests, as the directory is deprecated. * Use Node-Problem-Detector v0.6.3 on GCI * Increase default maximumLoadBalancerRuleCount to 250 * Fix Azure SLB support for multiple backend pools Azure VM and vmssVM support multiple backend pools for the same SLB, but not for different LBs. * disable HTTP2 ingress test * Upgrade compute API to version 2019-03-01 * Replace vmss update API with instance-level update API * Cleanup codes that not required any more * Cleanup interfaces and add unit tests * Update vendors * Create the "internal" firewall rule for kubemark master. This is equivalent to the "internal" firewall rule that is created for the regular masters. The main reason for doing it is to allow prometheus scraping metrics from various kubemark master components, e.g. kubelet. Ref. kubernetes/perf-tests#503 * Move back APIs to Azure stack supported version (#19)
I also have a kubernetes cluster in a vpc and I still get the same issue. i'm on kubernetes 1.11.9. |
Hi @jaybe78 . Yes, we managed to make it work by upgrading to 1.11.7. We're currently still on that version and we are no longer bumping into this issue. Jochen |
/sig cloud-provider |
I'll close this. The bug has been fixed (and released) on v1.11, v1.12, v1.13, and v1.14. |
Hello, We can see the event triggered by the workers (in cloudtrail) so it approved it happened automatically from the controller side ( not a human mistake) We saw it as well in dev cluster, when we lost some spot nodes it also happened, but again it not happen all the time. so still don't have specific flow how to reproduce it |
The fix is just to handle NLB type, doesn't need to make it global? (ELB, ALB) ? |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
When scaling up or down the amount of worker nodes, all rules in the the security group managed by the controller-manager are removed.
Here are the logs from the controller manager when the problem started: It doesn't contain any errors or warnings.
I don't know where the number "15" comes from, there's only one LoadBalancer (3 ports, 4 workers, 3 AZs).
Here are the logs from the controller-manager when another node has taken over the leadership:
What you expected to happen:
Security group rules should not be removed when the LoadBalancer has not changed.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I've only tested this with NLBs, ELBs might not be effected.
The Service has the following annotations:
Restarting the leader controller-manager solves the problem. The new leader will re-add the missing security group rules.
We're running multiple Kubernetes Clusters on the same AWS account.
Environment:
kubectl version
): v1.10.2The text was updated successfully, but these errors were encountered: