Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudprovider: aws: return true on existence check for stopped instances #66835

Merged
merged 1 commit into from
Sep 1, 2018

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Jul 31, 2018

xref https://bugzilla.redhat.com/show_bug.cgi?id=1559271
xref openshift/origin#19899

background #45986 (comment)

Basically our customers are hitting this issue where the Node resource is deleted when the AWS instances stop (not terminate). If the instances restart, the Nodes lose any labeling/taints.

Openstack cloudprovider already made this change #59931

fixes #45118 for AWS

Reviewer note: valid AWS instance states are pending | running | shutting-down | terminated | stopping | stopped. There might be a case for returning false for instances in pending and/or terminated state. Discuss!

InstanceID() changes from #45986 credit @rrati

@derekwaynecarr @smarterclayton @liggitt @justinsb @jsafrane @countspongebob

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 31, 2018
@sjenning
Copy link
Contributor Author

/release-note-none
/sig-aws

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 31, 2018
@smarterclayton
Copy link
Contributor

To be clear, what is the behavior of the other cloud providers (vmware, GCP, etc)? Are they like AWS or like OpenStack now?

@sjenning
Copy link
Contributor Author

sjenning commented Jul 31, 2018

#51409 implemented InstanceExistsByProviderID() for most of the cloudproviders. vshpere has since removed its vm.IsActive() check in #55845.

The following DO NOT filter on state:
Azure, CloudStack, GCP, vshpere, Openstack

The following DO filter on state:
Openstack (see update below), AWS

#59931 doesn't actually get the job done for Openstack due to https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack_instances.go#L125-L128

@sjenning
Copy link
Contributor Author

sjenning commented Jul 31, 2018

Actually looking at this again, the node controller calls InstanceID() in ensureNodeExistsByProviderID() for built-in cloudprovider usage and falls back to InstanceExistsByProviderID() only in the case of an external cloudprovider (i.e. providerID == "")

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cloud/node_controller.go#L446-L467

Removal of the Openstack filtering was done in the external cloudprovider code in kubernetes/cloud-provider-openstack#43.

So #59931 disables filtering for the built-in Openstack provider and kubernetes/cloud-provider-openstack#43 removes it for the external Openstack cloud provider.

I need to adjust InstanceID() for AWS then as well.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 31, 2018
@sjenning
Copy link
Contributor Author

sjenning commented Jul 31, 2018

The external AWS cloud provider simply vendors its code from kube atm, so the change to InstanceExistsByProviderID() is needed in-tree so that when they bump their k8s dep, they will get the fix as well.

https://github.com/kubernetes/cloud-provider-aws/blob/master/cmd/aws-cloud-controller-manager/main.go

@sjenning
Copy link
Contributor Author

Also note that vsphere DOES filter with the built-in but not with the external. What a crazy place this is...

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L755-L764

@derekwaynecarr
Copy link
Member

I agree w/ this change, and would say that the usage of taints and labels continues to grow and indirect loss of such attributes on a node have a number of unintended consequences for operators. I think the current ux provided by the AWS cloud provider is confusing end-users and should be changed.

See my prior comment:
#45986 (comment)

@neolit123
Copy link
Member

/sig cloud-provider
@kubernetes/sig-aws-misc

@k8s-ci-robot k8s-ci-robot added sig/aws sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Aug 1, 2018
@ingvagabund
Copy link
Contributor

/test pull-kubernetes-e2e-kops-aws

@liggitt
Copy link
Member

liggitt commented Aug 1, 2018

Also see the discussion at #46442 (comment)

This was brought up in sig-aws last week, and is on the agenda for sig-cloud-provider next week

The summary of where the various cloud providers are at currently is a good data point.

@nckturner
Copy link
Contributor

I also agree with this change. So, do the other cloud providers ever return terminated instances in the results? With this change, does AWS need to filter out recently terminated instances?

@liggitt
Copy link
Member

liggitt commented Aug 9, 2018

Reviewer note: valid AWS instance states are pending | running | shutting-down | terminated | stopping | stopped. There might be a case for returning false for instances in pending and/or terminated state. Discuss!

based on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html, a 'pending' state can occur while starting up a stopped instance, so it should still return true.

I could see returning false for instances in terminated state.

/cc @d-nishi

@k8s-ci-robot
Copy link
Contributor

@liggitt: GitHub didn't allow me to request PR reviews from the following users: d-nishi.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Reviewer note: valid AWS instance states are pending | running | shutting-down | terminated | stopping | stopped. There might be a case for returning false for instances in pending and/or terminated state. Discuss!

based on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html, a 'pending' state can occur while starting up a stopped instance, so it should still return true.

I could see returning false for instances in terminated state.

/cc @d-nishi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@derekwaynecarr
Copy link
Member

per #67254 this pr seems like the appropriate follow-on.

@derekwaynecarr
Copy link
Member

@jsafrane can you review and approve? see the reference to prereq approved pr on interface documentation change.

@gnufied
Copy link
Member

gnufied commented Aug 15, 2018

@sjenning AWS and Openstack are somewhat unique in this regard IIRC. Only these cloud providers remove the node when a node is shutdown.

In general writing disruptive tests that simulate this behaviour and making sure nothing breaks has been harder because kops currently deploys everything in a ASG and a shutdown node is automatically terminated and removed from ASG. I am in general okay with this change but we must keep in mind that, there are almost no e2e tests that capture this change.

@gnufied
Copy link
Member

gnufied commented Aug 15, 2018

Hmm looks like behavior of Openstack was fixed sometime back but I do remember it removing a shutdown/stopped node originally. GCE also keeps shutdown nodes in the list and does not remove them.

GCE fortunately has many disruptive tests, AWS has zero.

@gnufied
Copy link
Member

gnufied commented Aug 16, 2018

Storage has - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/pd.go#L307 and there are probably more.

a while back - we wrote some disruptive tests for AWS - gnufied@3df33c7 . Specifically gnufied@3df33c7#diff-cf6d6091f38a1751fd35b8257efc60a3 but there is no way to run them in current test-infra setup, so we ended up deleting those.

@sjenning
Copy link
Contributor Author

I have updated the commit to exclude instances in terminated state

@liggitt
Copy link
Member

liggitt commented Aug 17, 2018

/retest

@gnufied
Copy link
Member

gnufied commented Aug 17, 2018

The regression we noticed today where volumes were not being detached from AWS is possibly because of - #65392 . Before this patch, a stopped node caused deletion of all pods on it and now the pods on stopped nodes stick around in "unknown" but "Running" phase.

@childsb childsb mentioned this pull request Aug 24, 2018
@@ -4325,7 +4325,8 @@ func (c *Cloud) findInstanceByNodeName(nodeName types.NodeName) (*ec2.Instance,
privateDNSName := mapNodeNameToPrivateDNSName(nodeName)
filters := []*ec2.Filter{
newEc2Filter("private-dns-name", privateDNSName),
newEc2Filter("instance-state-name", "running"),
// exclude instances in "terminated" state
newEc2Filter("instance-state-name", "pending", "running", "shutting-down", "stopping", "stopped"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - can we make a constant out of the states we consider "alive" ?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2018
@sjenning
Copy link
Contributor Author

Rebased and fixed nit.

Please note that #59930 merged about 24 hours ago and removes the state check from InstanceExistsByProviderID(). I'm adding it back to consider instances in terminated state as dead and filter them out.

fyi @zetaab

@zetaab
Copy link
Member

zetaab commented Aug 31, 2018

@sjenning filtering out terminated instances is fine. However, you could use constants (those are defined in aws ec2 lib)

@sjenning
Copy link
Contributor Author

@zetaab good call. updated.

Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@gnufied
Copy link
Member

gnufied commented Aug 31, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, sjenning, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2018
@gnufied
Copy link
Member

gnufied commented Aug 31, 2018

cc @childsb @derekwaynecarr please add it to 1.12 milestone.

@derekwaynecarr
Copy link
Member

impacts sig node.

/sig node
/milestone v1.12
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 31, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Aug 31, 2018
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 31, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67571, 67284, 66835, 68096, 68152). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit ba78154 into kubernetes:master Sep 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud providers get instance/vm calls filter on running status