implement InstanceShutdownByProviderID to aws cloudprovider #59930

zetaab · 2018-02-15T18:01:03Z

What this PR does / why we need it: implement InstanceShutdownByProviderID to aws cloudprovider

Which issue(s) this PR fixes:
Fixes #59925

Special notes for your reviewer:

Release note:

NONE

jhorwit2

@zetaab There is a race condition here when the instance is stopping. That'd cause it to be deleted under the current logic since InstanceExistsByProviderID only checks if the instance isn't running.

jhorwit2 · 2018-02-16T04:34:35Z

/sig aws
/area cloudprovider

zetaab · 2018-02-16T05:40:47Z

@jhorwit2 you are right https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Do you have suggestions how I should solve this? Because if I add "stopping" state to this shutdown function as well. In theory that is not correct, but what you think? Also node might be deleted if instance is in pending state after it is started again. We could add stopping and pending to this shutdown function? In case of new node, pending state does not have any effect on current working solution because node is not added cluster yet.

zetaab · 2018-02-16T06:25:23Z

hmm actually I got better idea:

currently there are two possibilities how nodes are deleted: using nodelifecycle controller or cloud controller.

Nodelifecycle controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1291-L1306 this does not care about instance state at all? So when using nodelifecycle controller instance is not deleted currently in AWS. If we use this cloudprovider shutdown function as-is. And node taint is added correctly, ONLY when it is safe to detach volumes.

Cloud controller manager:
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1308-L1338 imo here we should check is the instance state != shutting-down or state != terminated. Then it should work correctly: shutdown taint is added only when it is safe to detach volume immediately (state stopped). Otherwise node is not just deleted from cluster (if the machine is not going to be terminated).

What you think, can we modify InstanceExistsByProviderID to work little bit different way in order to keep nodes in cluster? The current solution is not perfect because if instance is in state rebooting for long time, node is deleted from cluster.

Which solution is better? 1) add pending and stopping states to shutdown 2) modify InstanceExistsByProviderID (we should take care that we will modify this in all cloudproviders then). The spec is going to be changed from InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running. to InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is not going to be removed.

justinsb · 2018-02-16T13:41:47Z

pkg/cloudprovider/providers/aws/aws.go

+		return false, err
+	}
+	if len(instances) == 0 {
+		return false, nil


Does this imply that we took so long that AWS removed the instance from the list? I agree it is unexpected, but maybe glog.Warning and return true, nil

+1 we should return true here with a warning

If we return true here it means that instance is not removed from cluster. Instead shutdown taint is added. But the instance is deleted for real from aws and imo correct one is return false here. By returning false code will continue checks and finally remove node from cluster

justinsb · 2018-02-16T13:43:07Z

pkg/cloudprovider/providers/aws/aws.go

+
+	state := instances[0].State.Name
+	// valid state for detaching volumes is stopped 
+	if *state == "stopped" {


I think also terminated.

Might also be better to use the constants in the ec2 sdk: InstanceStateNameStopped and InstanceStateNameTerminated

@justinsb can you read my longer comment above? The thing here is that with nodelifecycle controller this thing is going to work pretty well. Nodes are not deleted from cluster if those are in some weird state like pending. However, in future with cloud controller manager this thing is not going to work. We have two interfaces where we should check all possible states of instance. Otherwise those are deleted from cluster. Currently we check only stopped, terminated(? i will add it) and active. We should add states shutting-down stopping pending and rebooting instance states to InstanceShutdownByProviderID or InstanceExistsByProviderID. Imo exists is correct place for these.

justinsb · 2018-02-16T13:44:06Z

pkg/cloudprovider/providers/aws/aws.go

+		return false, fmt.Errorf("multiple instances found for instance: %s", instanceID)
+	}
+
+	state := instances[0].State.Name


Maybe aws.StringValue(instances[0].State.Name) to avoid a panic if name is not set (though we'll still panic if state is nil, but ...)

I'd prefer checking state before doing this

justinsb · 2018-02-16T13:49:23Z

I'm not very familiar with these controller - it seems like they are new. I'm digging in to what they do, but some comments on the code in the meantime.

justinsb · 2018-02-16T13:50:30Z

Also, why are you stopping your instances on AWS, vs just terminating them?

justinsb · 2018-02-16T14:18:51Z

Ah - this is brand new. Let's review alongside the reapplication of #59323. I put some comments on the API here: #59323 (comment)

/ok-to-test

sjenning · 2018-02-16T17:12:35Z

fyi @eparis

fejta-bot · 2018-07-25T23:01:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-24T23:48:40Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

zetaab · 2018-08-29T10:59:16Z

/remove-lifecycle rotten

zetaab · 2018-08-29T13:51:48Z

/milestone v1.12

k8s-ci-robot · 2018-08-29T13:51:49Z

@zetaab: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zetaab · 2018-08-29T13:54:00Z

@justinsb can you review this and add milestone 1.12. This is needed to #67977

yastij · 2018-08-29T17:08:38Z

/retest

dims · 2018-08-29T19:27:28Z

@kubernetes/sig-aws-misc

gnufied · 2018-08-29T21:31:50Z

pkg/cloudprovider/providers/aws/aws.go

+	if instance.State != nil {
+		state := aws.StringValue(instance.State.Name)
+		// valid state for detaching volumes
+		if state == "stopped" || state == "terminated" {


Can we use InstanceStateNameStopped and InstanceStateNameTerminated constants instead?

gnufied · 2018-08-29T21:32:11Z

mostly looks good to me. one minor change and we are good to go.

gnufied · 2018-08-29T21:32:42Z

@saad-ali @childsb can we add this to 1.12 milestone please?

changes according what was asked use string do not delete instance if it is in any other state than running use constants fix

gnufied · 2018-08-30T16:07:16Z

/lgtm

k8s-ci-robot · 2018-08-30T16:07:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/providers/aws/OWNERS~~ [gnufied]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-08-30T19:28:45Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-08-30T19:40:30Z

Automatic merge from submit-queue (batch tested with PRs 67368, 59930, 68074). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

k8s-ci-robot · 2018-08-30T21:28:24Z

@zetaab: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-kops-aws	`a68cbd6`	link	`/test pull-kubernetes-e2e-kops-aws`
pull-kubernetes-kubemark-e2e-gce-big	`a68cbd6`	link	`/test pull-kubernetes-kubemark-e2e-gce-big`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot requested review from gnufied and justinsb February 15, 2018 18:01

jhorwit2 suggested changes Feb 16, 2018

View reviewed changes

k8s-ci-robot added sig/aws area/cloudprovider labels Feb 16, 2018

justinsb reviewed Feb 16, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 16, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2018

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 29, 2018

zetaab force-pushed the shutdoaw branch from 0c04b0b to 1494194 Compare August 29, 2018 11:00

k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Aug 29, 2018

zetaab force-pushed the shutdoaw branch from 1494194 to ea2340f Compare August 29, 2018 11:08

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2018

zetaab mentioned this pull request Aug 29, 2018

add detach logic for node shutdown taint #67977

Closed

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 29, 2018

gnufied reviewed Aug 29, 2018

View reviewed changes

implement InstanceShutdownByProviderID to aws cloudprovider

a68cbd6

changes according what was asked use string do not delete instance if it is in any other state than running use constants fix

zetaab force-pushed the shutdoaw branch from 9c6bea3 to a68cbd6 Compare August 30, 2018 05:51

jsafrane added this to the v1.12 milestone Aug 30, 2018

zetaab mentioned this pull request Aug 30, 2018

If kubelet is unavailable, AttachDetachController fails to force detach on pod deletion #65392

Closed

k8s-ci-robot assigned gnufied Aug 30, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2018

k8s-github-robot merged commit cfdefff into kubernetes:master Aug 30, 2018

zetaab deleted the shutdoaw branch August 31, 2018 06:34

sjenning mentioned this pull request Aug 31, 2018

cloudprovider: aws: return true on existence check for stopped instances #66835

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement InstanceShutdownByProviderID to aws cloudprovider #59930

implement InstanceShutdownByProviderID to aws cloudprovider #59930

zetaab commented Feb 15, 2018 •

edited

jhorwit2 left a comment

jhorwit2 commented Feb 16, 2018

zetaab commented Feb 16, 2018 •

edited

zetaab commented Feb 16, 2018 •

edited

justinsb Feb 16, 2018

yastij Feb 17, 2018

zetaab Feb 17, 2018

justinsb Feb 16, 2018

zetaab Feb 18, 2018

justinsb Feb 16, 2018

yastij Feb 17, 2018

justinsb commented Feb 16, 2018

justinsb commented Feb 16, 2018

justinsb commented Feb 16, 2018

sjenning commented Feb 16, 2018

fejta-bot commented Jul 25, 2018

fejta-bot commented Aug 24, 2018

zetaab commented Aug 29, 2018

zetaab commented Aug 29, 2018

k8s-ci-robot commented Aug 29, 2018

zetaab commented Aug 29, 2018

yastij commented Aug 29, 2018

dims commented Aug 29, 2018

gnufied Aug 29, 2018

gnufied commented Aug 29, 2018

gnufied commented Aug 29, 2018

gnufied commented Aug 30, 2018

k8s-ci-robot commented Aug 30, 2018

k8s-github-robot commented Aug 30, 2018

k8s-github-robot commented Aug 30, 2018

k8s-ci-robot commented Aug 30, 2018

implement InstanceShutdownByProviderID to aws cloudprovider #59930

implement InstanceShutdownByProviderID to aws cloudprovider #59930

Conversation

zetaab commented Feb 15, 2018 • edited

jhorwit2 left a comment

Choose a reason for hiding this comment

jhorwit2 commented Feb 16, 2018

zetaab commented Feb 16, 2018 • edited

zetaab commented Feb 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Feb 16, 2018

justinsb commented Feb 16, 2018

justinsb commented Feb 16, 2018

sjenning commented Feb 16, 2018

fejta-bot commented Jul 25, 2018

fejta-bot commented Aug 24, 2018

zetaab commented Aug 29, 2018

zetaab commented Aug 29, 2018

k8s-ci-robot commented Aug 29, 2018

zetaab commented Aug 29, 2018

yastij commented Aug 29, 2018

dims commented Aug 29, 2018

Choose a reason for hiding this comment

gnufied commented Aug 29, 2018

gnufied commented Aug 29, 2018

gnufied commented Aug 30, 2018

k8s-ci-robot commented Aug 30, 2018

k8s-github-robot commented Aug 30, 2018

k8s-github-robot commented Aug 30, 2018

k8s-ci-robot commented Aug 30, 2018

zetaab commented Feb 15, 2018 •

edited

zetaab commented Feb 16, 2018 •

edited

zetaab commented Feb 16, 2018 •

edited