-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloudprovider: aws: return true on existence check for stopped instances #66835
Conversation
/release-note-none |
To be clear, what is the behavior of the other cloud providers (vmware, GCP, etc)? Are they like AWS or like OpenStack now? |
#51409 implemented The following DO NOT filter on state: The following DO filter on state: #59931 doesn't actually get the job done for Openstack due to https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack_instances.go#L125-L128 |
Actually looking at this again, the node controller calls Removal of the Openstack filtering was done in the external cloudprovider code in kubernetes/cloud-provider-openstack#43. So #59931 disables filtering for the built-in Openstack provider and kubernetes/cloud-provider-openstack#43 removes it for the external Openstack cloud provider. I need to adjust |
763ed14
to
34490a5
Compare
The external AWS cloud provider simply vendors its code from kube atm, so the change to |
Also note that vsphere DOES filter with the built-in but not with the external. What a crazy place this is... |
I agree w/ this change, and would say that the usage of taints and labels continues to grow and indirect loss of such attributes on a node have a number of unintended consequences for operators. I think the current ux provided by the AWS cloud provider is confusing end-users and should be changed. See my prior comment: |
/sig cloud-provider |
/test pull-kubernetes-e2e-kops-aws |
Also see the discussion at #46442 (comment) This was brought up in sig-aws last week, and is on the agenda for sig-cloud-provider next week The summary of where the various cloud providers are at currently is a good data point. |
I also agree with this change. So, do the other cloud providers ever return terminated instances in the results? With this change, does AWS need to filter out recently terminated instances? |
based on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html, a 'pending' state can occur while starting up a stopped instance, so it should still return true. I could see returning false for instances in terminated state. /cc @d-nishi |
@liggitt: GitHub didn't allow me to request PR reviews from the following users: d-nishi. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
per #67254 this pr seems like the appropriate follow-on. |
@jsafrane can you review and approve? see the reference to prereq approved pr on interface documentation change. |
@sjenning AWS and Openstack are somewhat unique in this regard IIRC. Only these cloud providers remove the node when a node is shutdown. In general writing disruptive tests that simulate this behaviour and making sure nothing breaks has been harder because kops currently deploys everything in a ASG and a shutdown node is automatically terminated and removed from ASG. I am in general okay with this change but we must keep in mind that, there are almost no e2e tests that capture this change. |
Hmm looks like behavior of Openstack was fixed sometime back but I do remember it removing a shutdown/stopped node originally. GCE also keeps shutdown nodes in the list and does not remove them. GCE fortunately has many disruptive tests, AWS has zero. |
Storage has - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/pd.go#L307 and there are probably more. a while back - we wrote some disruptive tests for AWS - gnufied@3df33c7 . Specifically gnufied@3df33c7#diff-cf6d6091f38a1751fd35b8257efc60a3 but there is no way to run them in current test-infra setup, so we ended up deleting those. |
34490a5
to
75df7ff
Compare
I have updated the commit to exclude instances in |
/retest |
The regression we noticed today where volumes were not being detached from AWS is possibly because of - #65392 . Before this patch, a stopped node caused deletion of all pods on it and now the pods on stopped nodes stick around in "unknown" but "Running" phase. |
@@ -4325,7 +4325,8 @@ func (c *Cloud) findInstanceByNodeName(nodeName types.NodeName) (*ec2.Instance, | |||
privateDNSName := mapNodeNameToPrivateDNSName(nodeName) | |||
filters := []*ec2.Filter{ | |||
newEc2Filter("private-dns-name", privateDNSName), | |||
newEc2Filter("instance-state-name", "running"), | |||
// exclude instances in "terminated" state | |||
newEc2Filter("instance-state-name", "pending", "running", "shutting-down", "stopping", "stopped"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - can we make a constant out of the states we consider "alive" ?
75df7ff
to
85437ef
Compare
@sjenning filtering out terminated instances is fine. However, you could use constants (those are defined in aws ec2 lib) |
85437ef
to
bbd643f
Compare
@zetaab good call. updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, sjenning, zetaab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc @childsb @derekwaynecarr please add it to 1.12 milestone. |
impacts sig node. /sig node |
/retest Review the full test history for this PR. Silence the bot with an |
Automatic merge from submit-queue (batch tested with PRs 67571, 67284, 66835, 68096, 68152). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. |
xref https://bugzilla.redhat.com/show_bug.cgi?id=1559271
xref openshift/origin#19899
background #45986 (comment)
Basically our customers are hitting this issue where the Node resource is deleted when the AWS instances stop (not terminate). If the instances restart, the Nodes lose any labeling/taints.
Openstack cloudprovider already made this change #59931
fixes #45118 for AWS
Reviewer note: valid AWS instance states are
pending | running | shutting-down | terminated | stopping | stopped
. There might be a case for returningfalse
for instances inpending
and/orterminated
state. Discuss!InstanceID()
changes from #45986 credit @rrati@derekwaynecarr @smarterclayton @liggitt @justinsb @jsafrane @countspongebob