cloudprovider: aws: return true on existence check for stopped instances #66835

sjenning · 2018-07-31T18:50:22Z

xref https://bugzilla.redhat.com/show_bug.cgi?id=1559271
xref openshift/origin#19899

Basically our customers are hitting this issue where the Node resource is deleted when the AWS instances stop (not terminate). If the instances restart, the Nodes lose any labeling/taints.

Openstack cloudprovider already made this change #59931

fixes #45118 for AWS

InstanceID() changes from #45986 credit @rrati

@derekwaynecarr @smarterclayton @liggitt @justinsb @jsafrane @countspongebob

sjenning · 2018-07-31T18:51:40Z

/release-note-none
/sig-aws

smarterclayton · 2018-07-31T19:35:12Z

To be clear, what is the behavior of the other cloud providers (vmware, GCP, etc)? Are they like AWS or like OpenStack now?

sjenning · 2018-07-31T20:35:59Z

#51409 implemented InstanceExistsByProviderID() for most of the cloudproviders. vshpere has since removed its vm.IsActive() check in #55845.

The following DO NOT filter on state:
Azure, CloudStack, GCP, vshpere, Openstack

The following DO filter on state:
~~Openstack~~ (see update below), AWS

#59931 doesn't actually get the job done for Openstack due to https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack_instances.go#L125-L128

sjenning · 2018-07-31T21:21:03Z

Actually looking at this again, the node controller calls InstanceID() in ensureNodeExistsByProviderID() for built-in cloudprovider usage and falls back to InstanceExistsByProviderID() only in the case of an external cloudprovider (i.e. providerID == "")

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cloud/node_controller.go#L446-L467

Removal of the Openstack filtering was done in the external cloudprovider code in kubernetes/cloud-provider-openstack#43.

So #59931 disables filtering for the built-in Openstack provider and kubernetes/cloud-provider-openstack#43 removes it for the external Openstack cloud provider.

I need to adjust InstanceID() for AWS then as well.

sjenning · 2018-07-31T21:38:58Z

The external AWS cloud provider simply vendors its code from kube atm, so the change to InstanceExistsByProviderID() is needed in-tree so that when they bump their k8s dep, they will get the fix as well.

https://github.com/kubernetes/cloud-provider-aws/blob/master/cmd/aws-cloud-controller-manager/main.go

sjenning · 2018-07-31T21:47:57Z

Also note that vsphere DOES filter with the built-in but not with the external. What a crazy place this is...

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L755-L764

derekwaynecarr · 2018-07-31T22:08:48Z

I agree w/ this change, and would say that the usage of taints and labels continues to grow and indirect loss of such attributes on a node have a number of unintended consequences for operators. I think the current ux provided by the AWS cloud provider is confusing end-users and should be changed.

See my prior comment:
#45986 (comment)

neolit123 · 2018-08-01T00:56:11Z

/sig cloud-provider
@kubernetes/sig-aws-misc

ingvagabund · 2018-08-01T07:27:37Z

/test pull-kubernetes-e2e-kops-aws

liggitt · 2018-08-01T13:36:50Z

Also see the discussion at #46442 (comment)

This was brought up in sig-aws last week, and is on the agenda for sig-cloud-provider next week

The summary of where the various cloud providers are at currently is a good data point.

nckturner · 2018-08-06T17:22:59Z

I also agree with this change. So, do the other cloud providers ever return terminated instances in the results? With this change, does AWS need to filter out recently terminated instances?

liggitt · 2018-08-09T19:59:18Z

Reviewer note: valid AWS instance states are pending | running | shutting-down | terminated | stopping | stopped. There might be a case for returning false for instances in pending and/or terminated state. Discuss!

based on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html, a 'pending' state can occur while starting up a stopped instance, so it should still return true.

I could see returning false for instances in terminated state.

/cc @d-nishi

k8s-ci-robot · 2018-08-09T19:59:19Z

@liggitt: GitHub didn't allow me to request PR reviews from the following users: d-nishi.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Reviewer note: valid AWS instance states are pending | running | shutting-down | terminated | stopping | stopped. There might be a case for returning false for instances in pending and/or terminated state. Discuss!

based on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html, a 'pending' state can occur while starting up a stopped instance, so it should still return true.

I could see returning false for instances in terminated state.

/cc @d-nishi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

derekwaynecarr · 2018-08-15T20:25:10Z

per #67254 this pr seems like the appropriate follow-on.

derekwaynecarr · 2018-08-15T20:28:34Z

@jsafrane can you review and approve? see the reference to prereq approved pr on interface documentation change.

gnufied · 2018-08-15T22:02:41Z

@sjenning AWS and Openstack are somewhat unique in this regard IIRC. Only these cloud providers remove the node when a node is shutdown.

In general writing disruptive tests that simulate this behaviour and making sure nothing breaks has been harder because kops currently deploys everything in a ASG and a shutdown node is automatically terminated and removed from ASG. I am in general okay with this change but we must keep in mind that, there are almost no e2e tests that capture this change.

gnufied · 2018-08-15T22:11:21Z

Hmm looks like behavior of Openstack was fixed sometime back but I do remember it removing a shutdown/stopped node originally. GCE also keeps shutdown nodes in the list and does not remove them.

GCE fortunately has many disruptive tests, AWS has zero.

gnufied · 2018-08-16T02:35:47Z

Storage has - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/pd.go#L307 and there are probably more.

a while back - we wrote some disruptive tests for AWS - gnufied@3df33c7 . Specifically gnufied@3df33c7#diff-cf6d6091f38a1751fd35b8257efc60a3 but there is no way to run them in current test-infra setup, so we ended up deleting those.

sjenning · 2018-08-16T20:30:07Z

I have updated the commit to exclude instances in terminated state

liggitt · 2018-08-17T01:17:35Z

/retest

gnufied · 2018-08-17T04:35:40Z

The regression we noticed today where volumes were not being detached from AWS is possibly because of - #65392 . Before this patch, a stopped node caused deletion of all pods on it and now the pods on stopped nodes stick around in "unknown" but "Running" phase.

gnufied · 2018-08-30T16:08:46Z

pkg/cloudprovider/providers/aws/aws.go

@@ -4325,7 +4325,8 @@ func (c *Cloud) findInstanceByNodeName(nodeName types.NodeName) (*ec2.Instance,
 	privateDNSName := mapNodeNameToPrivateDNSName(nodeName)
 	filters := []*ec2.Filter{
 		newEc2Filter("private-dns-name", privateDNSName),
-		newEc2Filter("instance-state-name", "running"),
+		// exclude instances in "terminated" state
+		newEc2Filter("instance-state-name", "pending", "running", "shutting-down", "stopping", "stopped"),


minor nit - can we make a constant out of the states we consider "alive" ?

sjenning · 2018-08-31T19:12:24Z

Rebased and fixed nit.

Please note that #59930 merged about 24 hours ago and removes the state check from InstanceExistsByProviderID(). I'm adding it back to consider instances in terminated state as dead and filter them out.

fyi @zetaab

zetaab · 2018-08-31T19:16:30Z

@sjenning filtering out terminated instances is fine. However, you could use constants (those are defined in aws ec2 lib)

sjenning · 2018-08-31T19:25:52Z

@zetaab good call. updated.

zetaab

/approve

gnufied · 2018-08-31T19:27:43Z

/lgtm

k8s-ci-robot · 2018-08-31T19:27:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, sjenning, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/providers/aws/OWNERS~~ [gnufied]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gnufied · 2018-08-31T19:27:59Z

cc @childsb @derekwaynecarr please add it to 1.12 milestone.

derekwaynecarr · 2018-08-31T19:30:59Z

impacts sig node.

/sig node
/milestone v1.12
/priority critical-urgent

fejta-bot · 2018-08-31T22:34:49Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-github-robot · 2018-09-01T03:41:38Z

Automatic merge from submit-queue (batch tested with PRs 67571, 67284, 66835, 68096, 68152). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

k8s-ci-robot requested review from justinsb and zmerlynn July 31, 2018 18:50

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 31, 2018

sjenning force-pushed the aws-exist-check branch from 763ed14 to 34490a5 Compare July 31, 2018 21:34

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 31, 2018

k8s-ci-robot added sig/aws sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Aug 1, 2018

sjenning mentioned this pull request Aug 15, 2018

UPSTREAM: 66835: cloudprovider: aws: return true on existence check for stopped instances openshift/origin#20663

Merged

sjenning force-pushed the aws-exist-check branch from 34490a5 to 75df7ff Compare August 16, 2018 20:26

childsb mentioned this pull request Aug 24, 2018

Handle force detach #67847

Closed

gnufied reviewed Aug 30, 2018

View reviewed changes

gnufied mentioned this pull request Aug 30, 2018

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2018

sjenning force-pushed the aws-exist-check branch from 75df7ff to 85437ef Compare August 31, 2018 19:10

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2018

cloudprovider: aws: return true on existence check for stopped instances

bbd643f

sjenning force-pushed the aws-exist-check branch from 85437ef to bbd643f Compare August 31, 2018 19:25

zetaab approved these changes Aug 31, 2018

View reviewed changes

k8s-ci-robot assigned gnufied Aug 31, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2018

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 31, 2018

k8s-ci-robot added this to the v1.12 milestone Aug 31, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 31, 2018

k8s-github-robot merged commit ba78154 into kubernetes:master Sep 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloudprovider: aws: return true on existence check for stopped instances #66835

cloudprovider: aws: return true on existence check for stopped instances #66835

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018

smarterclayton commented Jul 31, 2018

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018

derekwaynecarr commented Jul 31, 2018

neolit123 commented Aug 1, 2018

ingvagabund commented Aug 1, 2018

liggitt commented Aug 1, 2018

nckturner commented Aug 6, 2018

liggitt commented Aug 9, 2018

k8s-ci-robot commented Aug 9, 2018

derekwaynecarr commented Aug 15, 2018

derekwaynecarr commented Aug 15, 2018

gnufied commented Aug 15, 2018 •

edited

Loading

gnufied commented Aug 15, 2018 •

edited

Loading

gnufied commented Aug 16, 2018

sjenning commented Aug 16, 2018

liggitt commented Aug 17, 2018

gnufied commented Aug 17, 2018

gnufied Aug 30, 2018

sjenning commented Aug 31, 2018

zetaab commented Aug 31, 2018

sjenning commented Aug 31, 2018

zetaab left a comment

gnufied commented Aug 31, 2018

k8s-ci-robot commented Aug 31, 2018

gnufied commented Aug 31, 2018

derekwaynecarr commented Aug 31, 2018

fejta-bot commented Aug 31, 2018

k8s-github-robot commented Sep 1, 2018

cloudprovider: aws: return true on existence check for stopped instances #66835

cloudprovider: aws: return true on existence check for stopped instances #66835

Conversation

sjenning commented Jul 31, 2018 • edited Loading

sjenning commented Jul 31, 2018

smarterclayton commented Jul 31, 2018

sjenning commented Jul 31, 2018 • edited Loading

sjenning commented Jul 31, 2018 • edited Loading

sjenning commented Jul 31, 2018 • edited Loading

sjenning commented Jul 31, 2018

derekwaynecarr commented Jul 31, 2018

neolit123 commented Aug 1, 2018

ingvagabund commented Aug 1, 2018

liggitt commented Aug 1, 2018

nckturner commented Aug 6, 2018

liggitt commented Aug 9, 2018

k8s-ci-robot commented Aug 9, 2018

derekwaynecarr commented Aug 15, 2018

derekwaynecarr commented Aug 15, 2018

gnufied commented Aug 15, 2018 • edited Loading

gnufied commented Aug 15, 2018 • edited Loading

gnufied commented Aug 16, 2018

sjenning commented Aug 16, 2018

liggitt commented Aug 17, 2018

gnufied commented Aug 17, 2018

gnufied Aug 30, 2018

Choose a reason for hiding this comment

sjenning commented Aug 31, 2018

zetaab commented Aug 31, 2018

sjenning commented Aug 31, 2018

zetaab left a comment

Choose a reason for hiding this comment

gnufied commented Aug 31, 2018

k8s-ci-robot commented Aug 31, 2018

gnufied commented Aug 31, 2018

derekwaynecarr commented Aug 31, 2018

fejta-bot commented Aug 31, 2018

k8s-github-robot commented Sep 1, 2018

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018 •

edited

Loading

sjenning commented Jul 31, 2018 •

edited

Loading

gnufied commented Aug 15, 2018 •

edited

Loading

gnufied commented Aug 15, 2018 •

edited

Loading