Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement InstanceShutdownByProviderID to aws cloudprovider #59930

Merged
merged 1 commit into from
Aug 30, 2018

Conversation

zetaab
Copy link
Member

@zetaab zetaab commented Feb 15, 2018

What this PR does / why we need it: implement InstanceShutdownByProviderID to aws cloudprovider

Which issue(s) this PR fixes:
Fixes #59925

Special notes for your reviewer:

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 15, 2018
Copy link
Contributor

@jhorwit2 jhorwit2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zetaab There is a race condition here when the instance is stopping. That'd cause it to be deleted under the current logic since InstanceExistsByProviderID only checks if the instance isn't running.

@jhorwit2
Copy link
Contributor

/sig aws
/area cloudprovider

@zetaab
Copy link
Member Author

zetaab commented Feb 16, 2018

@jhorwit2 you are right https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Do you have suggestions how I should solve this? Because if I add "stopping" state to this shutdown function as well. In theory that is not correct, but what you think? Also node might be deleted if instance is in pending state after it is started again. We could add stopping and pending to this shutdown function? In case of new node, pending state does not have any effect on current working solution because node is not added cluster yet.

@zetaab
Copy link
Member Author

zetaab commented Feb 16, 2018

hmm actually I got better idea:

currently there are two possibilities how nodes are deleted: using nodelifecycle controller or cloud controller.

Nodelifecycle controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1291-L1306 this does not care about instance state at all? So when using nodelifecycle controller instance is not deleted currently in AWS. If we use this cloudprovider shutdown function as-is. And node taint is added correctly, ONLY when it is safe to detach volumes.

Cloud controller manager:
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1308-L1338 imo here we should check is the instance state != shutting-down or state != terminated. Then it should work correctly: shutdown taint is added only when it is safe to detach volume immediately (state stopped). Otherwise node is not just deleted from cluster (if the machine is not going to be terminated).

What you think, can we modify InstanceExistsByProviderID to work little bit different way in order to keep nodes in cluster? The current solution is not perfect because if instance is in state rebooting for long time, node is deleted from cluster.

Which solution is better? 1) add pending and stopping states to shutdown 2) modify InstanceExistsByProviderID (we should take care that we will modify this in all cloudproviders then). The spec is going to be changed from InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running. to InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is not going to be removed.

return false, err
}
if len(instances) == 0 {
return false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that we took so long that AWS removed the instance from the list? I agree it is unexpected, but maybe glog.Warning and return true, nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we should return true here with a warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return true here it means that instance is not removed from cluster. Instead shutdown taint is added. But the instance is deleted for real from aws and imo correct one is return false here. By returning false code will continue checks and finally remove node from cluster


state := instances[0].State.Name
// valid state for detaching volumes is stopped
if *state == "stopped" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also terminated.

Might also be better to use the constants in the ec2 sdk: InstanceStateNameStopped and InstanceStateNameTerminated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb can you read my longer comment above? The thing here is that with nodelifecycle controller this thing is going to work pretty well. Nodes are not deleted from cluster if those are in some weird state like pending. However, in future with cloud controller manager this thing is not going to work. We have two interfaces where we should check all possible states of instance. Otherwise those are deleted from cluster. Currently we check only stopped, terminated(? i will add it) and active. We should add states shutting-down stopping pending and rebooting instance states to InstanceShutdownByProviderID or InstanceExistsByProviderID. Imo exists is correct place for these.

return false, fmt.Errorf("multiple instances found for instance: %s", instanceID)
}

state := instances[0].State.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe aws.StringValue(instances[0].State.Name) to avoid a panic if name is not set (though we'll still panic if state is nil, but ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer checking state before doing this

@justinsb
Copy link
Member

I'm not very familiar with these controller - it seems like they are new. I'm digging in to what they do, but some comments on the code in the meantime.

@justinsb
Copy link
Member

Also, why are you stopping your instances on AWS, vs just terminating them?

@justinsb
Copy link
Member

Ah - this is brand new. Let's review alongside the reapplication of #59323. I put some comments on the API here: #59323 (comment)

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 16, 2018
@sjenning
Copy link
Contributor

fyi @eparis

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2018
@zetaab
Copy link
Member Author

zetaab commented Aug 29, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 29, 2018
@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Aug 29, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 29, 2018
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 29, 2018
@zetaab
Copy link
Member Author

zetaab commented Aug 29, 2018

/milestone v1.12

@k8s-ci-robot
Copy link
Contributor

@zetaab: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zetaab
Copy link
Member Author

zetaab commented Aug 29, 2018

@justinsb can you review this and add milestone 1.12. This is needed to #67977

@yastij
Copy link
Member

yastij commented Aug 29, 2018

/retest

@dims
Copy link
Member

dims commented Aug 29, 2018

@kubernetes/sig-aws-misc

if instance.State != nil {
state := aws.StringValue(instance.State.Name)
// valid state for detaching volumes
if state == "stopped" || state == "terminated" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use InstanceStateNameStopped and InstanceStateNameTerminated constants instead?

@gnufied
Copy link
Member

gnufied commented Aug 29, 2018

mostly looks good to me. one minor change and we are good to go.

@gnufied
Copy link
Member

gnufied commented Aug 29, 2018

@saad-ali @childsb can we add this to 1.12 milestone please?

changes according what was asked

use string

do not delete instance if it is in any other state than running

use constants

fix
@gnufied
Copy link
Member

gnufied commented Aug 30, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67368, 59930, 68074). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit cfdefff into kubernetes:master Aug 30, 2018
@k8s-ci-robot
Copy link
Contributor

@zetaab: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws a68cbd6 link /test pull-kubernetes-e2e-kops-aws
pull-kubernetes-kubemark-e2e-gce-big a68cbd6 link /test pull-kubernetes-kubemark-e2e-gce-big

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement InstanceShutdownByProviderID to AWS